Building an LLM Powered Travel Visa Advisor

Visa requirements are one of those things that seem simple until they aren’t. Imagine you hold a Brazilian passport, live in Canada on a work permit, and want to visit the Schengen area for two weeks. Suddenly the questions compound: Does having Canadian permanent residency qualify you for any waiver programs? What are the re-entry requirements to come back to Canada? Are there any recent policy changes, like the EU’s upcoming ETIAS system, you should know about? Each of these lives on a different government website, written in a different style of legalese, and none of them talk to each other.

That frustration led me to build Visa Advisor, an LLM-powered service that takes a traveler’s complete document profile (passports, residency documents, visa statuses) and a destination, then returns clear, structured guidance. Not a chatbot. Not a generic FAQ. A purpose-built analysis engine that searches the web for current policies, runs them through an LLM pipeline, and delivers actionable results.

In this post, I’ll walk through the architecture, the design decisions that shaped it, how I approach evaluation and feedback, and what I learned along the way. You can try the service yourself at travelvisaadvisor.net.

Architecture at a Glance

The system follows a layered architecture: a lightweight frontend, a fully async Python backend, and a set of external services that do the heavy lifting.

Tech stack:

Component	Technology
Backend	Python 3.9+, FastAPI, Gunicorn/Uvicorn
LLM	Anthropic Claude (Haiku for speed, Sonnet for depth)
Web Search	Tavily API
Cache & Database	Supabase (PostgreSQL)
Frontend	Vanilla HTML/JS/CSS
Deployment	Docker, Fly.io, GitHub Actions
Observability	Langfuse
Evaluation	pytest, DeepEval (LLM-as-a-judge)

The backend is organized around four core services, each with a single responsibility:

VisaCheckerService: the orchestrator that coordinates the entire flow: check cache, run web searches, invoke the LLM, store results.
WebSearchService: handles parallel web searches via the Tavily API.
LLMService: manages all interactions with Claude, including prompt construction, streaming, and response validation.
CacheService: reads from and writes to Supabase, turning expensive 10-30 second operations into instant cache hits.

The design principle was straightforward: keep each service stateless and composable. The orchestrator calls them in sequence (or parallel where possible), and any service can fail gracefully without taking down the others.

How a Request Flows Through the System

Here’s what happens when a user submits a visa check.

flowchart TD
    UserInput[User Input] --> ParseInput[Step 1: Parse Input via Claude Haiku]
    ParseInput --> CheckCache[Step 2: Check Cache in Supabase]
    CheckCache -->|Cache Hit| ReturnCached[Return Cached Response]
    CheckCache -->|Cache Miss| WebSearch[Step 3: Parallel Web Searches via Tavily]
    WebSearch --> LLMAnalysis[Step 4: LLM Analysis via Claude]
    LLMAnalysis --> Reflection{Step 5: Reflection Enabled?}
    Reflection -->|Yes| Reflector[Review via Claude Sonnet]
    Reflection -->|No| CacheReturn[Step 6: Cache and Return]
    Reflector --> CacheReturn

Step 1: Parse the Input. Users can either fill out a structured form or type a natural language query like “I’m a Nigerian citizen with Canadian permanent residency, traveling to Japan next month.” For free-form input, I used Claude Haiku specifically for its speed. It extracts structured fields (citizenship, residency, destination, dates) in milliseconds at temperature 0.0 for deterministic parsing, keeping the parsing step fast and cost-effective. If the parser’s confidence is below a threshold or critical fields are missing, the UI shows a preview for the user to confirm before proceeding.

Step 2: Check the Cache. Before doing any expensive work, the service hashes the request parameters using SHA-256 and checks Supabase for a cached response. Cache entries have a configurable TTL (default 30 days) and a version number. Bumping the version in config invalidates all existing entries, which is useful when prompts or search strategies change. One subtle optimization: if a user requests a standard analysis but an enhanced result already exists in the cache, the system returns the better version at no extra cost.

Step 3: Parallel Web Searches. This is where the async architecture pays off. For a request with N passports, M residency documents, and P visa statuses, the system fires off 2N + M + P + 1 searches concurrently using asyncio.gather(): visa requirements per passport (N), visa waiver programs per passport (N), residency travel benefits per document (M), re-entry requirements per visa status (P), and current policy updates (1). A dual citizen with a work permit and permanent residency status triggers about 8 parallel searches. Running these in parallel instead of sequentially cuts the search phase from 30+ seconds to under 5 seconds.

Step 4: LLM Analysis. All search results are formatted into a structured prompt and sent to Claude. The system prompt instructs the model to determine if a visa is required, recommend which passport/document combination to use, flag critical gotchas (overstay risks, transit visa requirements), generate a preparation checklist with deadlines, provide re-entry requirements, and include links to official resources. The response is validated against a Pydantic schema to ensure it contains all required fields.

Step 5: Optional Reflection. For complex cases, the service supports a “reflection” mode where a second, more capable model reviews and improves the initial analysis. More on this below.

Step 6: Cache and Return. The validated response is cached in Supabase and returned to the user with a detailed timing breakdown.

Three LLM Roles, Not One

Rather than a single “do everything” prompt, the system uses three distinct LLM roles, each optimized for a specific task.

The Query Parser (Claude Haiku) handles natural language extraction, turning free-form input into structured data. This is a classic “LLM as a structured extractor” pattern. Temperature is set to 0.0 because we need reliable, consistent parsing, not creative interpretation.

The Analyzer (Claude Haiku) is the core of the system. It receives all the web search results along with the traveler’s document profile and produces a structured JSON determination. Using Haiku here was a deliberate cost/speed tradeoff. For a factual analysis task grounded in search results, Haiku performs remarkably well at a fraction of the cost and latency of larger models. The output schema is enforced through a detailed system prompt and validated with Pydantic on the backend.

The Reflector (Claude Sonnet) is where things get interesting. It implements what I call an analyze-then-reflect pattern, a lightweight alternative to full multi-agent systems. After the Analyzer produces its initial determination, the Reflector reviews the entire output, looking for factual errors, missing information, inconsistencies, and safety concerns. It then produces an improved version with corrections applied.

This is not a debate between agents. It’s a structured self-review: one fast model drafts, one more capable model critiques and refines. The key insight is that reviewing is easier than generating from scratch. The Reflector has a concrete artifact to evaluate rather than starting from a blank page. If the Reflector’s output fails schema validation, the system falls back gracefully to the original Analyzer response.

Users can toggle between standard (Analyzer only) and enhanced (Analyzer + Reflector) analysis. Enhanced takes longer and costs more, but catches edge cases that the Analyzer might miss, especially for complex scenarios like dual citizens with multiple visa statuses.

Web Search as Grounding

One of the earliest and most consequential design decisions was to never rely on the LLM’s training data for visa requirements. Immigration rules change frequently. A policy that was accurate during training could be outdated by the time a user asks. In this domain, hallucination isn’t just unhelpful; it’s potentially harmful.

Instead, every request triggers real-time web searches via the Tavily API. Why Tavily? It provides a search API specifically designed for LLM applications, returning clean, extracted content rather than raw HTML, which reduces token usage and improves analysis quality.

How Queries Are Crafted

Search queries are programmatically constructed by combining the traveler’s profile with targeted search terms. For example, a Brazilian citizen traveling to Japan generates queries like "Brazilian passport visa requirements Japan" and "Brazilian citizen visa waiver Japan". Each residency document adds queries like "Canadian permanent resident travel benefits Japan", and each visa status adds re-entry requirement queries. A final query targets recent policy updates: "Japan visa policy changes 2025". This ensures the search coverage maps directly to the traveler’s specific situation rather than relying on a single generic query.

Source Preferences

Not all search results carry equal weight. The system is designed to prioritize official and authoritative sources in a clear hierarchy:

Government domains (.gov, .gc.ca, .gov.uk, etc.) and official immigration authority websites are treated as the most reliable sources.
Embassy and consulate websites are the next tier, as they provide destination-specific guidance for travelers from particular countries.
Established travel advisory sites (e.g., IATA TravelCentre, Timatic) serve as useful secondary references for cross-validation.
Forums, blogs, and user-generated content are deprioritized, as they may contain outdated or anecdotal information.

Tavily’s domain filtering capabilities help steer searches toward these authoritative sources. The LLM’s system prompt further reinforces this hierarchy: prioritize government sources, cross-reference multiple results, and flag any contradictions between sources.

This “search then analyze” pattern effectively turns the LLM into a reasoning engine over fresh data rather than a knowledge retrieval system. The web search provides the facts; the LLM provides the synthesis.

Streaming for Better UX

A full visa analysis takes 10-30 seconds on a cache miss. That’s an eternity in web UX. Rather than showing a spinner for the entire duration, the service streams results to the frontend as they become available using Server-Sent Events (SSE) with NDJSON patches.

The difference streaming makes is significant. Without it, users stare at a loading indicator with no sense of progress, no idea whether the system is working or stuck. With streaming, the experience feels alive. Stage events communicate progress (“Searching for relevant sources,” “Analyzing results,” “Running enhanced analysis”), so users always know where they are in the pipeline. As the LLM generates its response, the summary text forms in real-time, warnings appear one by one, and the checklist builds out incrementally. Users can start reading and processing the results while the rest of the analysis is still being generated.

This creates a dramatically smoother experience. Perceived wait time drops because users are engaged with partial results instead of watching a spinner. It also builds trust: when you can see the system actively working through your specific situation, it feels more thorough and reliable than a response that simply appears all at once after a long pause.

Evaluation: The Non-Negotiable

Building an LLM-powered service without evaluation is like writing code without tests. It might work today, but you have no way to know if tomorrow’s prompt change broke something. For a service dealing with travel and immigration information, getting things wrong has real consequences. Someone could show up at an airport without the right documents.

I built a dedicated evaluation framework using DeepEval with an LLM-as-a-judge approach.

Four Weighted Metrics

The evaluation captures different dimensions of response quality:

Accuracy (35% weight): Is the visa determination correct? Are fees, processing times, and document requirements accurate? This carries the highest weight because getting the core determination wrong is the most consequential failure mode.

Completeness (25%): Does the response cover all sections: summary, recommendation, gotchas, checklist, re-entry info, and official links? A response that correctly says “visa required” but omits the application process is incomplete.

Clarity (20%): Is the response well-organized, actionable, and specific? Does it include concrete steps and deadlines rather than vague guidance?

Consistency (20%): Does the internal logic hold up? If visa_required is true, does the recommendation section describe an application process? Do timelines add up?

Alternative Metrics Worth Considering

The four metrics above capture the core quality dimensions, but there are other metrics worth considering depending on the use case:

Relevance: Are the answers directly relevant to the traveler’s specific query and situation, rather than providing generic visa information?
Faithfulness / Groundedness: Does the response strictly adhere to the information found in search results, or does it introduce claims not supported by the retrieved sources? This is especially important for RAG-based systems where hallucination is a key risk.
Timeliness: Are the referenced policies and requirements current? Does the response cite recent sources rather than outdated ones?

Alternative Evaluation Frameworks

DeepEval worked well for this project, but there are other solid options in the space:

Ragas: A framework specifically designed for evaluating Retrieval-Augmented Generation (RAG) pipelines. It provides metrics like faithfulness, answer relevancy, and context precision out of the box, which map well to search-grounded applications.
promptfoo: A CLI-based tool for testing and evaluating prompts across different models and configurations. Its strength is in comparative prompt testing, where you want to A/B test different prompt versions against a shared test suite.
Human evaluation panels: For high-stakes domains like immigration, periodic human review of a sample of responses provides a ground truth that no automated metric can fully replace. This is more expensive and slower, but valuable as a calibration check.

Test Cases and Workflow

Test cases are defined in JSON with structured inputs and expected outputs, tagged for filtering: common, complex, dual_citizenship, visa_required, visa_free. This allows running targeted evaluation suites, for example all dual citizenship scenarios, or just the “visa free” cases.

Evaluations hit the live API (with caching disabled), run the response through all four metrics, compute a weighted score, and output detailed results. The pass threshold is 0.7 per metric with an overall pass rate of 80% required for the suite to succeed, making it suitable for CI/CD integration.

The evaluation framework made it possible to iterate on prompts with confidence. Change a system prompt, run the eval suite, and see exactly how accuracy, completeness, clarity, and consistency shifted. Without it, every prompt change was a leap of faith.

Feedback Loop

Evaluation catches issues at development time, but production quality depends on continuous feedback from real users. The service includes a built-in feedback mechanism: after receiving their results, users can rate the response and leave comments. Feedback is stored alongside the cached response in Supabase, creating a dataset that ties user satisfaction back to specific inputs, search results, and model outputs.

This feedback loop is essential. Edge cases that no test suite anticipates surface through real usage, and that data feeds back into improved prompts, better search strategies, and new test cases.

Observability and Debugging

LLM applications are notoriously hard to debug. When a response is wrong, the question is always: was it because the search returned bad results, or because the LLM misinterpreted them?

I integrated Langfuse for LLM tracing. Every request is traced with the full prompt sent to Claude, raw search results from Tavily, the LLM response before and after reflection, and token usage and latency breakdowns. This made it dramatically easier to diagnose issues and tune prompts. When something goes wrong, I can trace the full pipeline and pinpoint exactly where the analysis diverged.

Deployment

The service runs on Fly.io with a straightforward setup: a Docker container with Gunicorn (ASGI) serving the FastAPI app, frontend served via Nginx, and GitHub Actions CI/CD with pushes to main deploying to production and release/* branches deploying to staging. Health checks run every 30 seconds, and auto-scaling goes all the way down to zero machines when idle.

The entire stack runs within Fly.io’s free tier for low-traffic usage, making it cost-effective to keep running as a live service. Going forward, I’m exploring hosting on DigitalOcean’s inference platform for cost optimization, particularly as traffic grows beyond the free tier limits.

Example: A Complex Query in Action

To make this concrete, here is a real query run through the system with Standard analysis mode.

Input (natural language):

“I hold both a Nigerian passport and a British passport. I’m currently living in Canada on a work permit. I want to travel to Japan for a 2-week vacation, departing from Toronto.”

This is a complex scenario: dual citizenship, a work permit in a third country, and questions around which passport to use, visa requirements, and re-entry to Canada.

What the system returned:

Determination: No Visa Required

Summary: The system compared both passport options side by side. It identified the British passport as the clear winner: visa-free entry to Japan for up to 90 days with no application required. It flagged that the Nigerian passport would require a visa application (eVISA or embassy visit, 2-3 working days). It also noted that the Canadian work permit does not itself grant visa exemptions for Japan, but that the traveler should use their actual passport (not the work permit document) for travel.

Recommendation: Travel on the British passport for immediate, hassle-free entry.

Key gotchas flagged (10 total, here are a few):

Passport must be valid for entire stay, with at least 1 blank page
Cannot switch visa types while in Japan (tourist to work/study)
90-day maximum stay, overstaying is a criminal offense
New departure tax of 3,000 yen effective July 1, 2026
Controlled substances (including some common prescriptions) are strictly prohibited
Re-entry to Canada requires presenting both British passport and valid Canadian work permit

Checklist generated: 12 action items including verifying passport validity, confirming work permit status, arranging proof of return ticket, preparing proof of funds, and budgeting for the departure tax.

Re-entry requirements: Detailed guidance on returning to Canada as a work permit holder, including required documents and a warning that the work permit must remain valid throughout the trip.

Official sources cited: 11 links, including GOV.UK, Japan Ministry of Foreign Affairs, the Japanese Consulate in Toronto, Travel.gc.ca, and the U.S. State Department.

Timing: Web search 3.59s, LLM processing 37.20s, total 40.80s.

This single query triggered parallel searches across Nigerian passport requirements, British passport requirements, visa waiver programs for both, Canadian work permit travel benefits, re-entry requirements, and current Japan policy updates. The system synthesized all of that into a structured, actionable response with passport-specific comparisons, which is the kind of nuanced analysis that is difficult to get from any single government website.

Lessons Learned

Prompt engineering is iteration, not inspiration. The system prompt for the Analyzer went through many revisions. Early versions produced verbose, unstructured text. The breakthrough was defining an exact JSON schema in the prompt and being extremely specific about what each field should contain. Pydantic validation on the backend creates a tight feedback loop: you know immediately when the model’s output drifts from the expected structure.

The reflection pattern is powerful. For the additional processing time and cost, it delivers the most accurate and most confident information. Adding a second LLM pass to review the first model’s output caught a surprising number of edge cases, especially around re-entry requirements and the interplay between dual citizenship and residency status. The pattern is simple to implement (it’s just a second API call with the first response as context) and the quality improvement is meaningful for complex scenarios.

Async-first is worth the investment. The service makes many I/O-bound calls: web searches, LLM API, database. FastAPI’s async support combined with asyncio.gather(), the async Anthropic SDK, and async Supabase client means all of these run concurrently. This was one of the biggest performance wins in the project, cutting total request time significantly.

Caching should be a day-one decision. Implementing caching early paid dividends. It reduced API costs dramatically, improved response times for repeat queries, and the cache key design (sorted, normalized, versioned) avoided the typical cache invalidation headaches.

Evaluation is non-negotiable. Without the eval framework, every prompt change was guesswork. With it, changes became measurable. The LLM-as-a-judge approach isn’t perfect, but it is far better than manual spot-checking and scales to any number of test cases.

Current Bottleneck

The main bottleneck in the system is LLM inference speed. Even with parallel web searches and caching optimizations in place, the LLM analysis step dominates total latency on cache misses. Users wait on sequential LLM calls: parsing the input, running the main analysis, and optionally running the reflection pass. Each of these calls depends on the previous one’s output, so they cannot be parallelized.

For a standard analysis, this means two sequential LLM calls (parsing + analysis). For an enhanced analysis with reflection, it’s three. As models get faster and providers continue improving inference infrastructure, this bottleneck will naturally ease, but for now it remains the primary constraint on response time.

One path forward is exploring alternative models that can serve the same purpose at lower latency. Faster models from other providers, or smaller fine-tuned models, could reduce per-call time significantly. However, this requires careful benchmarking across the full evaluation suite to ensure accuracy and completeness do not degrade. Speed means nothing if the quality of the visa determination suffers. Any model swap would need to pass the same weighted metrics (accuracy, completeness, clarity, consistency) at the same thresholds before going to production.

What I’d Do Differently

Structured outputs from the start. Structured outputs are a feature where the LLM is constrained to return responses conforming to a specific JSON schema, guaranteeing valid, parseable output every time. Instead of hoping the model follows your formatting instructions and then writing validation logic to catch when it doesn’t, structured outputs enforce the schema at generation time. Claude now supports this capability, and using it from the beginning would have replaced much of the manual JSON parsing and Pydantic validation logic in the backend, simplifying the code and eliminating schema drift issues.

Streaming from day one. I added SSE streaming later in the project. Designing for it from the beginning would have been cleaner and avoided some refactoring.

Better error granularity. When a search fails for one passport but succeeds for another, the current error handling could surface this more clearly to the user rather than treating it as a blanket failure.

Try It Out

Visa Advisor is live at travelvisaadvisor.net. While the source code is not publicly available, you can use the service directly to see it in action. If you’re building something similar, especially an LLM-powered service that deals with factual, high-stakes information, I hope this walkthrough saves you some of the trial and error I went through. Invest early in web search grounding, structured output validation, caching, and evaluation. These aren’t nice-to-haves; they’re what separate a demo from a product.

I’d love to hear your feedback. Feel free to try the service and reach out directly.