From 55f095ec76a4d718fb18282119b72a7ff75258f6 Mon Sep 17 00:00:00 2001 From: Rune Olsen Date: Tue, 27 Jan 2026 09:46:02 +0100 Subject: [PATCH] Multilingual --- MULTILINGUAL.md | 290 ++++++++++++++++++++++++++++++++++++++++++++++ src/ai/prompts.py | 12 +- 2 files changed, 299 insertions(+), 3 deletions(-) create mode 100644 MULTILINGUAL.md diff --git a/MULTILINGUAL.md b/MULTILINGUAL.md new file mode 100644 index 0000000..794793a --- /dev/null +++ b/MULTILINGUAL.md @@ -0,0 +1,290 @@ +# Multilingual Support + +News Agent supports articles in multiple languages, including Norwegian, English, and others. + +## Configuration + +### 1. Add Norwegian RSS Sources + +In `config.yaml`: + +```yaml +sources: + rss: + - name: "NRK Nyheter" + url: "https://www.nrk.no/toppsaker.rss" + category: "tech" + + - name: "Digi.no" + url: "https://www.digi.no/rss" + category: "tech" + + - name: "Kode24" + url: "https://www.kode24.no/rss" + category: "development" +``` + +### 2. Add Multilingual Interests + +In `config.yaml` under `ai.interests`: + +```yaml +ai: + interests: + - "Technology news from Norway (Norwegian articles welcome)" + - "Norwegian tech industry and startups" + - "General news from Norway in Norwegian" + - "AI and machine learning developments" + - "Self-hosting solutions" + # ... other interests +``` + +**Key points:** +- Be explicit: "Norwegian articles welcome" or "in Norwegian" +- Mention the country/region for local news +- Use natural language + +### 3. Language Handling + +The AI prompts now explicitly support multiple languages: + +**Filtering:** +- Articles evaluated regardless of language +- Norwegian content given equal consideration +- Interest matching works across languages + +**Summarization:** +- Summaries written in the SAME language as the source +- Norwegian articles → Norwegian summaries +- English articles → English summaries + +## Tips for Better Norwegian Content + +### 1. Be Specific with Interests + +❌ **Too vague:** +```yaml +- "News from Norway" +``` + +✅ **Better:** +```yaml +- "Norwegian technology news (accept Norwegian language)" +- "Politik og samfunn fra Norge" +- "Norsk tech-industri og oppstartselskaper" +``` + +### 2. Lower Filtering Threshold + +Norwegian content might score slightly lower initially. Try: + +```yaml +ai: + filtering: + min_score: 5.0 # Lower threshold (was 5.5) +``` + +### 3. Use Norwegian News Sources + +**Tech/Development:** +- Digi.no: `https://www.digi.no/rss` +- Kode24: `https://www.kode24.no/rss` +- Tek.no: `https://www.tek.no/rss` + +**General News:** +- NRK: `https://www.nrk.no/toppsaker.rss` +- VG: `https://www.vg.no/rss/feed/` +- Aftenposten: `https://www.aftenposten.no/rss` + +**Business/Tech:** +- DN (Dagens Næringsliv): Check their RSS feeds +- E24: `https://e24.no/rss` + +### 4. Mixed Language Email + +Your email will contain both English and Norwegian articles: + +``` +TECH +---- +• Norwegian startup raises $10M (English summary) +• Norsk AI-selskap lanserer ny tjeneste (Norwegian summary) + +DEVELOPMENT +----------- +• New Python framework released (English summary) +• Kode24: Sånn bruker du GitHub Copilot (Norwegian summary) +``` + +## Troubleshooting + +### Not Getting Norwegian Articles? + +**1. Check if articles are being fetched:** +```bash +# Enable debug logging +# In config.yaml: +logging: + level: "DEBUG" + +# Run and check +python -m src.main +grep -i "norwegian\|norge" data/logs/news-agent.log +``` + +**2. Check filtering scores:** +```bash +sqlite3 data/articles.db "SELECT title, relevance_score, source FROM articles WHERE source LIKE '%norsk%' OR source LIKE '%norweg%' ORDER BY fetched_at DESC LIMIT 10;" +``` + +**3. Verify RSS feed works:** +```bash +curl -s "https://www.digi.no/rss" | head -50 +``` + +**4. Manually check if feed has recent content:** +Visit the RSS URL in your browser + +### Norwegian Articles Scoring Low? + +**Possible reasons:** + +1. **Interest not specific enough** + - Add: "Norwegian technology and business news in Norwegian language" + +2. **Threshold too high** + - Lower to 4.5 or 5.0 + +3. **Content too general** + - Norwegian general news might not match "tech" interests + - Add specific Norwegian interests + +4. **Article content is short** + - Some RSS feeds only include headlines + - AI can't judge relevance from title alone + +### Mixed Results? + +**If you're getting English but not Norwegian:** + +1. **Check the interest phrasing:** + ```yaml + # Add to top of interests list: + interests: + - "Norwegian news and technology (Norwegian language accepted)" + - "Norge: teknologi, samfunn, og næringsliv" + ``` + +2. **Use a more permissive model:** + ```yaml + ai: + model: "anthropic/claude-3.5-haiku" # Better with multiple languages + ``` + +3. **Test with debug mode:** + ```bash + # Enable debug logging and run + python -m src.main 2>&1 | grep -A 3 -B 3 "Norwegian\|Norge" + ``` + +## Example Configuration + +Complete example supporting both English and Norwegian: + +```yaml +sources: + rss: + # English sources + - name: "Hacker News" + url: "https://news.ycombinator.com/rss" + category: "tech" + + - name: "TechCrunch" + url: "https://techcrunch.com/feed/" + category: "tech" + + # Norwegian sources + - name: "Digi.no" + url: "https://www.digi.no/rss" + category: "tech" + + - name: "Kode24" + url: "https://www.kode24.no/rss" + category: "development" + + - name: "NRK Nyheter" + url: "https://www.nrk.no/toppsaker.rss" + category: "tech" + +ai: + model: "openai/gpt-4o-mini" + + filtering: + enabled: true + min_score: 5.0 # Slightly lower for Norwegian content + max_articles: 20 # More articles to ensure Norwegian included + + interests: + # Norwegian-specific + - "Norwegian technology news and developments (Norwegian language)" + - "Norsk tech-industri, oppstartselskaper, og innovasjon" + - "General news from Norway or about Norway" + + # General (works for both languages) + - "AI and machine learning developments" + - "Open source projects and tools" + - "Self-hosting solutions" + - "Python and software development" +``` + +## Language Statistics + +After running, check language distribution: + +```bash +sqlite3 data/articles.db " +SELECT + CASE + WHEN source LIKE '%norsk%' OR source LIKE '%digi%' OR source LIKE '%kode%' THEN 'Norwegian' + ELSE 'English' + END as language, + COUNT(*) as count, + AVG(relevance_score) as avg_score +FROM articles +WHERE processed = 1 +GROUP BY language; +" +``` + +## Best Practices + +1. **Be explicit** - Tell AI that Norwegian is welcome +2. **Lower threshold** - 5.0 instead of 5.5 or 6.5 +3. **More articles** - Increase `max_articles` to 20 +4. **Specific interests** - Mention Norwegian topics explicitly +5. **Good sources** - Use active Norwegian tech/news RSS feeds +6. **Test first** - Run manually with debug logging + +## Models and Multilingual Support + +All modern models support multiple languages well: + +| Model | Norwegian Support | Recommendation | +|-------|------------------|----------------| +| `openai/gpt-4o-mini` | Excellent | ✅ Recommended | +| `anthropic/claude-3.5-haiku` | Excellent | ✅ Best for multilingual | +| `google/gemini-2.0-flash-exp:free` | Good | ⚠️ Has rate limits | + +Claude models are particularly good with Scandinavian languages. + +## Summary + +To get Norwegian articles in your digest: + +1. ✅ Add Norwegian RSS sources +2. ✅ Add explicit Norwegian interests ("Norwegian language accepted") +3. ✅ Lower filtering threshold to 5.0 +4. ✅ Updated prompts (already done!) +5. ✅ Test with: `python -m src.main` + +The summaries will automatically be in the same language as the source article! diff --git a/src/ai/prompts.py b/src/ai/prompts.py index f9ecbda..4a79791 100644 --- a/src/ai/prompts.py +++ b/src/ai/prompts.py @@ -1,6 +1,8 @@ """Prompt templates for AI processing""" -FILTERING_SYSTEM_PROMPT = """You are a news relevance analyzer. Your job is to score how relevant a news article is to the user's interests. +FILTERING_SYSTEM_PROMPT = """You are a multilingual news relevance analyzer. Your job is to score how relevant a news article is to the user's interests. + +IMPORTANT: Articles can be in ANY language (English, Norwegian, etc.). Evaluate content regardless of language. User Interests: {interests} @@ -13,7 +15,8 @@ Score the article on a scale of 0-10 based on: Return ONLY a JSON object with this exact format: {{"score": , "reason": ""}} -Be strict - only highly relevant articles should score above 7.0.""" +Be strict - only highly relevant articles should score above 7.0. +Give Norwegian articles the SAME consideration as English articles.""" FILTERING_USER_PROMPT = """Article Title: {title} @@ -26,7 +29,9 @@ Content Preview: {content} Score this article's relevance (0-10) and explain why.""" -SUMMARIZATION_SYSTEM_PROMPT = """You are a technical news summarizer. Create concise, informative summaries of tech articles. +SUMMARIZATION_SYSTEM_PROMPT = """You are a multilingual technical news summarizer. Create concise, informative summaries of articles. + +IMPORTANT: If the article is in Norwegian, write the summary in Norwegian. If in English, write in English. Match the source language. Guidelines: - Focus on key facts, findings, and implications @@ -34,6 +39,7 @@ Guidelines: - Keep summaries to 2-3 sentences - Use clear, professional language - Highlight what makes this newsworthy +- Preserve the original article's language Return ONLY the summary text, no additional formatting."""