|
Building production-ready Generative AI apps on AWS—Lessons from bringing AI into ZEPIC’s platform

Hey tech folks!

Like many of you, we at ZEPIC have been exploring how to architect AI into the foundation of our product. As the co-founder and head of engineering, I've had the privilege of leading this transformation. Having worked with AI technologies for over a decade, well before the current Gen-AI revolution, I've witnessed how the AI landscape has evolved and the massive opportunity it now presents.

Along the way of building our AI engine, we’ve made our share of mistakes to understand what truly works in production. I wanted to share some practical insights and lessons from building Gen AI applications on AWS. If you're considering a similar path, here’s what we've discovered— Read along and I hope it helps.

Core Technology Decisions

Programming Language and Frameworks

The choice of Python was a no-brainer due to the wide range of libraries available for machine learning and generative AI purposes. For frameworks, we evaluated several options, including LlamaIndex, Langchain, and Autogen. For web server implementation, FastAPI proved to be our best choice.

AWS Offerings

Before settling on AWS Bedrock, we had evaluated several AI providers including OpenAI, Anthropic, and Cohere. We went with AWS Bedrock for its wide range of models, high availability, and robust guardrails. Here is why I believe that AWS's offerings are better suited for building generative AI applications—

1. Bedrock: Bedrock's strength lies in its diverse model selection, allowing us to choose the right model for specific use cases. Some of the popular models include Claude models for complex reasoning, LLama models for general-purpose tasks, Cohere for embeddings, and Jurassic for specialized applications. We loved Bedrock's flexibility, where you can even import custom models from HuggingFace to suit your specific needs.

Source

2. SageMaker: For custom model deployment and training

3. Guardrails: For security and content control

Choosing the Right Model

Start with a smaller model like LLama 3.1 (available in 8B, 70B, 405B versions). While Claude Sonnet is the most intelligent model available in the AWS marketplace, remember that smaller models mean less latency and lower costs. 

Production Challenges and Solutions

When selecting an AI provider, we found three factors to be critical—the range of available models, security features, and system availability. Here is how we addressed each of these in our production environment:

Security with AWS Guardrails

Security sits at the heart of ZEPIC's AI architecture. We really wanted to ensure that our ZENIE AI is secure. So, we built an app firewall with AWS Guardrails, which masks PII data, prevents prompt injection, enables response grounding, filters profanity, and blocks violence-related content.

High Availability

AWS supports cross-region inference for certain models, but this is limited. To overcome this, we built our custom model route algorithm which distributes the load across multiple regions. This includes deploying Claude instances in both US and Europe, implementing round-robin traffic routing between regions. For enterprises with predictable, high-volume needs, AWS offers provisioned throughput, but comes with a hefty price tag—one model unit for Sonnet costs $29,262.

Output Control and Optimization

Most of the time, we want the model to return structured output. We use prompt techniques by providing examples, format instructions, and adding fallback mechanisms to make the model output as deterministic as possible.

Prompt Caching

If you're using Anthropic models, try implementing prompt caching to improve performance and save money.

Model Deployment

For high-volume traffic, especially if you're a dedicated GenAI company, consider deploying your own models using SageMaker. For example, for image generation, you can deploy Stable Diffusion.

How to build advanced AI capabilities?

Simple prompting doesn't always meet complex implementation needs. When working with custom implementations, particularly where specific output formats are important, you might need to consider fine-tuning.

Fine-Tuning

While most use cases can be solved with good prompt engineering, some scenarios require fine-tuning your model. AWS offers two ways to implement fine-tuning:

  • Full Fine-tuning: While powerful, this requires substantial computational resources and a large dataset.
  • Performance Efficient Fine-tuning (PEFT/LORA): A more resource-efficient approach available for certain models on AWS, reducing computational costs while maintaining performance.
A simple flow of fine-tuning a model(Source)

RAG (Retrieval Augmented Generation)
For some use cases, we wanted to build AI capabilities on top of our own data—be it documents, knowledge base articles, or other content. This is where RAG (Retrieval Augmented Generation) comes in. As the name suggests, it augments the model's responses by retrieving relevant information from our data sources.

These were our key considerations for RAG implementation:

  • Choose the right embedding model (we use Cohere)
  • Select the appropriate vector database (we evaluated Milvus, Weaviate, ES)
  • Implement an effective chunking strategy
  • Focus on retrieval and reranking

A typical RAG implementation(Source)
Our Production Journey

Organizational Structure

We established a dedicated team focused on building AI solutions for the long term. We are also educating our team members about GenAI and providing access to tools like Amazon Q and GitHub Copilot to developers and QA.

Current Implementation

Our production journey has been focused and iterative, with our entire infrastructure managed through Terraform CDK to ensure consistency and reproducibility across deployments. Implementation time has been surprisingly reasonable, particularly for API-based integrations, allowing us to move faster while maintaining reliability.

The decision to build everything with API-first architecture has been the right choice for us. Our trial customers have responded positively, and our approach has gained recognition—ZEPIC was ranked as a top AI marketing tool on Good AI Tools immediately after launch, and continues to be featured on leading AI directories including The Rundown AI, Futurepedia, AI Tools Inc, and The Neuron.

Looking back at our journey, one thing is clear: Generative AI isn't just another hype—AI can now understand text, multiple languages, audio, video, images, and more. The key is starting small and growing based on actual needs. Remember—before jumping into fine-tuning or complex implementations, do what you can with prompt engineering. Focus on security and availability, and let real user needs guide your expansion.

Hope these pointers help you out - Good luck with your AI journey! If you have any thoughts, please feel free to share.

Recent blogs post