Multilingual
This commit is contained in:
290
MULTILINGUAL.md
Normal file
290
MULTILINGUAL.md
Normal file
@@ -0,0 +1,290 @@
|
|||||||
|
# Multilingual Support
|
||||||
|
|
||||||
|
News Agent supports articles in multiple languages, including Norwegian, English, and others.
|
||||||
|
|
||||||
|
## Configuration
|
||||||
|
|
||||||
|
### 1. Add Norwegian RSS Sources
|
||||||
|
|
||||||
|
In `config.yaml`:
|
||||||
|
|
||||||
|
```yaml
|
||||||
|
sources:
|
||||||
|
rss:
|
||||||
|
- name: "NRK Nyheter"
|
||||||
|
url: "https://www.nrk.no/toppsaker.rss"
|
||||||
|
category: "tech"
|
||||||
|
|
||||||
|
- name: "Digi.no"
|
||||||
|
url: "https://www.digi.no/rss"
|
||||||
|
category: "tech"
|
||||||
|
|
||||||
|
- name: "Kode24"
|
||||||
|
url: "https://www.kode24.no/rss"
|
||||||
|
category: "development"
|
||||||
|
```
|
||||||
|
|
||||||
|
### 2. Add Multilingual Interests
|
||||||
|
|
||||||
|
In `config.yaml` under `ai.interests`:
|
||||||
|
|
||||||
|
```yaml
|
||||||
|
ai:
|
||||||
|
interests:
|
||||||
|
- "Technology news from Norway (Norwegian articles welcome)"
|
||||||
|
- "Norwegian tech industry and startups"
|
||||||
|
- "General news from Norway in Norwegian"
|
||||||
|
- "AI and machine learning developments"
|
||||||
|
- "Self-hosting solutions"
|
||||||
|
# ... other interests
|
||||||
|
```
|
||||||
|
|
||||||
|
**Key points:**
|
||||||
|
- Be explicit: "Norwegian articles welcome" or "in Norwegian"
|
||||||
|
- Mention the country/region for local news
|
||||||
|
- Use natural language
|
||||||
|
|
||||||
|
### 3. Language Handling
|
||||||
|
|
||||||
|
The AI prompts now explicitly support multiple languages:
|
||||||
|
|
||||||
|
**Filtering:**
|
||||||
|
- Articles evaluated regardless of language
|
||||||
|
- Norwegian content given equal consideration
|
||||||
|
- Interest matching works across languages
|
||||||
|
|
||||||
|
**Summarization:**
|
||||||
|
- Summaries written in the SAME language as the source
|
||||||
|
- Norwegian articles → Norwegian summaries
|
||||||
|
- English articles → English summaries
|
||||||
|
|
||||||
|
## Tips for Better Norwegian Content
|
||||||
|
|
||||||
|
### 1. Be Specific with Interests
|
||||||
|
|
||||||
|
❌ **Too vague:**
|
||||||
|
```yaml
|
||||||
|
- "News from Norway"
|
||||||
|
```
|
||||||
|
|
||||||
|
✅ **Better:**
|
||||||
|
```yaml
|
||||||
|
- "Norwegian technology news (accept Norwegian language)"
|
||||||
|
- "Politik og samfunn fra Norge"
|
||||||
|
- "Norsk tech-industri og oppstartselskaper"
|
||||||
|
```
|
||||||
|
|
||||||
|
### 2. Lower Filtering Threshold
|
||||||
|
|
||||||
|
Norwegian content might score slightly lower initially. Try:
|
||||||
|
|
||||||
|
```yaml
|
||||||
|
ai:
|
||||||
|
filtering:
|
||||||
|
min_score: 5.0 # Lower threshold (was 5.5)
|
||||||
|
```
|
||||||
|
|
||||||
|
### 3. Use Norwegian News Sources
|
||||||
|
|
||||||
|
**Tech/Development:**
|
||||||
|
- Digi.no: `https://www.digi.no/rss`
|
||||||
|
- Kode24: `https://www.kode24.no/rss`
|
||||||
|
- Tek.no: `https://www.tek.no/rss`
|
||||||
|
|
||||||
|
**General News:**
|
||||||
|
- NRK: `https://www.nrk.no/toppsaker.rss`
|
||||||
|
- VG: `https://www.vg.no/rss/feed/`
|
||||||
|
- Aftenposten: `https://www.aftenposten.no/rss`
|
||||||
|
|
||||||
|
**Business/Tech:**
|
||||||
|
- DN (Dagens Næringsliv): Check their RSS feeds
|
||||||
|
- E24: `https://e24.no/rss`
|
||||||
|
|
||||||
|
### 4. Mixed Language Email
|
||||||
|
|
||||||
|
Your email will contain both English and Norwegian articles:
|
||||||
|
|
||||||
|
```
|
||||||
|
TECH
|
||||||
|
----
|
||||||
|
• Norwegian startup raises $10M (English summary)
|
||||||
|
• Norsk AI-selskap lanserer ny tjeneste (Norwegian summary)
|
||||||
|
|
||||||
|
DEVELOPMENT
|
||||||
|
-----------
|
||||||
|
• New Python framework released (English summary)
|
||||||
|
• Kode24: Sånn bruker du GitHub Copilot (Norwegian summary)
|
||||||
|
```
|
||||||
|
|
||||||
|
## Troubleshooting
|
||||||
|
|
||||||
|
### Not Getting Norwegian Articles?
|
||||||
|
|
||||||
|
**1. Check if articles are being fetched:**
|
||||||
|
```bash
|
||||||
|
# Enable debug logging
|
||||||
|
# In config.yaml:
|
||||||
|
logging:
|
||||||
|
level: "DEBUG"
|
||||||
|
|
||||||
|
# Run and check
|
||||||
|
python -m src.main
|
||||||
|
grep -i "norwegian\|norge" data/logs/news-agent.log
|
||||||
|
```
|
||||||
|
|
||||||
|
**2. Check filtering scores:**
|
||||||
|
```bash
|
||||||
|
sqlite3 data/articles.db "SELECT title, relevance_score, source FROM articles WHERE source LIKE '%norsk%' OR source LIKE '%norweg%' ORDER BY fetched_at DESC LIMIT 10;"
|
||||||
|
```
|
||||||
|
|
||||||
|
**3. Verify RSS feed works:**
|
||||||
|
```bash
|
||||||
|
curl -s "https://www.digi.no/rss" | head -50
|
||||||
|
```
|
||||||
|
|
||||||
|
**4. Manually check if feed has recent content:**
|
||||||
|
Visit the RSS URL in your browser
|
||||||
|
|
||||||
|
### Norwegian Articles Scoring Low?
|
||||||
|
|
||||||
|
**Possible reasons:**
|
||||||
|
|
||||||
|
1. **Interest not specific enough**
|
||||||
|
- Add: "Norwegian technology and business news in Norwegian language"
|
||||||
|
|
||||||
|
2. **Threshold too high**
|
||||||
|
- Lower to 4.5 or 5.0
|
||||||
|
|
||||||
|
3. **Content too general**
|
||||||
|
- Norwegian general news might not match "tech" interests
|
||||||
|
- Add specific Norwegian interests
|
||||||
|
|
||||||
|
4. **Article content is short**
|
||||||
|
- Some RSS feeds only include headlines
|
||||||
|
- AI can't judge relevance from title alone
|
||||||
|
|
||||||
|
### Mixed Results?
|
||||||
|
|
||||||
|
**If you're getting English but not Norwegian:**
|
||||||
|
|
||||||
|
1. **Check the interest phrasing:**
|
||||||
|
```yaml
|
||||||
|
# Add to top of interests list:
|
||||||
|
interests:
|
||||||
|
- "Norwegian news and technology (Norwegian language accepted)"
|
||||||
|
- "Norge: teknologi, samfunn, og næringsliv"
|
||||||
|
```
|
||||||
|
|
||||||
|
2. **Use a more permissive model:**
|
||||||
|
```yaml
|
||||||
|
ai:
|
||||||
|
model: "anthropic/claude-3.5-haiku" # Better with multiple languages
|
||||||
|
```
|
||||||
|
|
||||||
|
3. **Test with debug mode:**
|
||||||
|
```bash
|
||||||
|
# Enable debug logging and run
|
||||||
|
python -m src.main 2>&1 | grep -A 3 -B 3 "Norwegian\|Norge"
|
||||||
|
```
|
||||||
|
|
||||||
|
## Example Configuration
|
||||||
|
|
||||||
|
Complete example supporting both English and Norwegian:
|
||||||
|
|
||||||
|
```yaml
|
||||||
|
sources:
|
||||||
|
rss:
|
||||||
|
# English sources
|
||||||
|
- name: "Hacker News"
|
||||||
|
url: "https://news.ycombinator.com/rss"
|
||||||
|
category: "tech"
|
||||||
|
|
||||||
|
- name: "TechCrunch"
|
||||||
|
url: "https://techcrunch.com/feed/"
|
||||||
|
category: "tech"
|
||||||
|
|
||||||
|
# Norwegian sources
|
||||||
|
- name: "Digi.no"
|
||||||
|
url: "https://www.digi.no/rss"
|
||||||
|
category: "tech"
|
||||||
|
|
||||||
|
- name: "Kode24"
|
||||||
|
url: "https://www.kode24.no/rss"
|
||||||
|
category: "development"
|
||||||
|
|
||||||
|
- name: "NRK Nyheter"
|
||||||
|
url: "https://www.nrk.no/toppsaker.rss"
|
||||||
|
category: "tech"
|
||||||
|
|
||||||
|
ai:
|
||||||
|
model: "openai/gpt-4o-mini"
|
||||||
|
|
||||||
|
filtering:
|
||||||
|
enabled: true
|
||||||
|
min_score: 5.0 # Slightly lower for Norwegian content
|
||||||
|
max_articles: 20 # More articles to ensure Norwegian included
|
||||||
|
|
||||||
|
interests:
|
||||||
|
# Norwegian-specific
|
||||||
|
- "Norwegian technology news and developments (Norwegian language)"
|
||||||
|
- "Norsk tech-industri, oppstartselskaper, og innovasjon"
|
||||||
|
- "General news from Norway or about Norway"
|
||||||
|
|
||||||
|
# General (works for both languages)
|
||||||
|
- "AI and machine learning developments"
|
||||||
|
- "Open source projects and tools"
|
||||||
|
- "Self-hosting solutions"
|
||||||
|
- "Python and software development"
|
||||||
|
```
|
||||||
|
|
||||||
|
## Language Statistics
|
||||||
|
|
||||||
|
After running, check language distribution:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
sqlite3 data/articles.db "
|
||||||
|
SELECT
|
||||||
|
CASE
|
||||||
|
WHEN source LIKE '%norsk%' OR source LIKE '%digi%' OR source LIKE '%kode%' THEN 'Norwegian'
|
||||||
|
ELSE 'English'
|
||||||
|
END as language,
|
||||||
|
COUNT(*) as count,
|
||||||
|
AVG(relevance_score) as avg_score
|
||||||
|
FROM articles
|
||||||
|
WHERE processed = 1
|
||||||
|
GROUP BY language;
|
||||||
|
"
|
||||||
|
```
|
||||||
|
|
||||||
|
## Best Practices
|
||||||
|
|
||||||
|
1. **Be explicit** - Tell AI that Norwegian is welcome
|
||||||
|
2. **Lower threshold** - 5.0 instead of 5.5 or 6.5
|
||||||
|
3. **More articles** - Increase `max_articles` to 20
|
||||||
|
4. **Specific interests** - Mention Norwegian topics explicitly
|
||||||
|
5. **Good sources** - Use active Norwegian tech/news RSS feeds
|
||||||
|
6. **Test first** - Run manually with debug logging
|
||||||
|
|
||||||
|
## Models and Multilingual Support
|
||||||
|
|
||||||
|
All modern models support multiple languages well:
|
||||||
|
|
||||||
|
| Model | Norwegian Support | Recommendation |
|
||||||
|
|-------|------------------|----------------|
|
||||||
|
| `openai/gpt-4o-mini` | Excellent | ✅ Recommended |
|
||||||
|
| `anthropic/claude-3.5-haiku` | Excellent | ✅ Best for multilingual |
|
||||||
|
| `google/gemini-2.0-flash-exp:free` | Good | ⚠️ Has rate limits |
|
||||||
|
|
||||||
|
Claude models are particularly good with Scandinavian languages.
|
||||||
|
|
||||||
|
## Summary
|
||||||
|
|
||||||
|
To get Norwegian articles in your digest:
|
||||||
|
|
||||||
|
1. ✅ Add Norwegian RSS sources
|
||||||
|
2. ✅ Add explicit Norwegian interests ("Norwegian language accepted")
|
||||||
|
3. ✅ Lower filtering threshold to 5.0
|
||||||
|
4. ✅ Updated prompts (already done!)
|
||||||
|
5. ✅ Test with: `python -m src.main`
|
||||||
|
|
||||||
|
The summaries will automatically be in the same language as the source article!
|
||||||
@@ -1,6 +1,8 @@
|
|||||||
"""Prompt templates for AI processing"""
|
"""Prompt templates for AI processing"""
|
||||||
|
|
||||||
FILTERING_SYSTEM_PROMPT = """You are a news relevance analyzer. Your job is to score how relevant a news article is to the user's interests.
|
FILTERING_SYSTEM_PROMPT = """You are a multilingual news relevance analyzer. Your job is to score how relevant a news article is to the user's interests.
|
||||||
|
|
||||||
|
IMPORTANT: Articles can be in ANY language (English, Norwegian, etc.). Evaluate content regardless of language.
|
||||||
|
|
||||||
User Interests:
|
User Interests:
|
||||||
{interests}
|
{interests}
|
||||||
@@ -13,7 +15,8 @@ Score the article on a scale of 0-10 based on:
|
|||||||
Return ONLY a JSON object with this exact format:
|
Return ONLY a JSON object with this exact format:
|
||||||
{{"score": <float>, "reason": "<brief explanation>"}}
|
{{"score": <float>, "reason": "<brief explanation>"}}
|
||||||
|
|
||||||
Be strict - only highly relevant articles should score above 7.0."""
|
Be strict - only highly relevant articles should score above 7.0.
|
||||||
|
Give Norwegian articles the SAME consideration as English articles."""
|
||||||
|
|
||||||
FILTERING_USER_PROMPT = """Article Title: {title}
|
FILTERING_USER_PROMPT = """Article Title: {title}
|
||||||
|
|
||||||
@@ -26,7 +29,9 @@ Content Preview: {content}
|
|||||||
Score this article's relevance (0-10) and explain why."""
|
Score this article's relevance (0-10) and explain why."""
|
||||||
|
|
||||||
|
|
||||||
SUMMARIZATION_SYSTEM_PROMPT = """You are a technical news summarizer. Create concise, informative summaries of tech articles.
|
SUMMARIZATION_SYSTEM_PROMPT = """You are a multilingual technical news summarizer. Create concise, informative summaries of articles.
|
||||||
|
|
||||||
|
IMPORTANT: If the article is in Norwegian, write the summary in Norwegian. If in English, write in English. Match the source language.
|
||||||
|
|
||||||
Guidelines:
|
Guidelines:
|
||||||
- Focus on key facts, findings, and implications
|
- Focus on key facts, findings, and implications
|
||||||
@@ -34,6 +39,7 @@ Guidelines:
|
|||||||
- Keep summaries to 2-3 sentences
|
- Keep summaries to 2-3 sentences
|
||||||
- Use clear, professional language
|
- Use clear, professional language
|
||||||
- Highlight what makes this newsworthy
|
- Highlight what makes this newsworthy
|
||||||
|
- Preserve the original article's language
|
||||||
|
|
||||||
Return ONLY the summary text, no additional formatting."""
|
Return ONLY the summary text, no additional formatting."""
|
||||||
|
|
||||||
|
|||||||
Reference in New Issue
Block a user