Edu_Omni_MyMind

EduMIND — Multimodal Bilingual Lecture Assistant & Active Learning Pipeline

Python 3.10 uv Streamlit Qdrant Label Studio License: MIT

EduMIND is an enterprise-grade, highly modular Bilingual Lecture Assistant & Active Learning Pipeline. Designed specifically for academic environments where lectures mix languages (e.g., Code-Mixed Vietnamese-English, such as “hôm nay chúng ta học attention mechanism”), EduMIND transcribes bilingual speech, measures code-switching metrics, translates text preserving technical terms, indexes slides, and executes retrieval-augmented generation (RAG).

The system integrates a Human-in-the-Loop Active Learning framework powered by Label Studio and an ML backend to continually harvest human-corrected data, immediately updating the local knowledge base and building a gold-standard corpus.


🎙️ Core Components & Architecture

                  +----------------------------------+
                  |    Bilingual Audio Lecture       |
                  +-----------------+----------------+
                                    |
                                    v
                       [ 🎙️ Bilingual Note-Taker ]
                         Whisper ASR + Post-RegEx
                                    |
                                    v
                     [ 🔄 VietMix Translation & CMI ]
                        Dict / Seq2Seq Translation
                                    |
                                    v
                      [ 📚 Anti-Forget RAG Engine ]
                         PDF Chunking -> Qdrant
                                    |
        +---------------------------+---------------------------+
        |                                                       |
        v (Retrieval QA)                                        v (Active Learning)
 [ Streamlit Assistant ]                              [ Label Studio UI (Port 8080) ]
   RAG Chat + Analytics                                  TA/Human Review & Correction
                                                                |
                                                                v
                                                       [ edumind_ml_backend ]
                                                         - Writes to corpus.jsonl
                                                         - Re-indexes to Qdrant Vector DB

1. Bilingual Note-Taker (Speech ASR)

2. VietMix Machine Translation

3. Anti-Forget RAG Engine

4. Active Learning Loop (Label Studio Hook)


📂 Project Organization

├── LICENSE                           <- MIT License
├── README.md                         <- This main system guide
├── CONTRIBUTING.md                   <- Development, CI/CD, and style guidelines
├── Makefile                          <- Task automation commands
├── pyproject.toml                    <- Project specs & package dependencies
├── uv.lock                           <- Lockfile for exact package reproducibility
├── docker-compose.yml                <- Docker compose configuration for the LS stack
├── Dockerfile.label-studio           <- Multi-stage Docker build for the ML backend
│
├── configs/
│   └── default_config.yaml           <- Hyperparameter configurations
│
├── data/
│   ├── raw/
│   │   ├── audio_chunks/             <- Raw lecture wav chunks
│   │   └── pdf_slides/               <- PDF lecture materials
│   └── processed/
│       └── corpus.jsonl              <- Target gold-standard active learning corpus
│
├── edumind/                          <- Core Python source package
│   ├── app.py                        <- Streamlit frontend implementation
│   ├── config/                       <- Pydantic validation definitions
│   ├── core/                         <- Logger, Dependency Injection container, Exceptions
│   ├── models/                       <- Data models & schemas (ASR, Translation, RAG)
│   ├── modules/                      <- Core engines (RAG, Speech ASR, VietMix Translator)
│   ├── services/                     <- Strategy implementations (Embedding, LLM, Translation)
│   └── utils/                        <- String utilities, file helpers, model registries
│
├── label_studio_backend/             <- Flask active learning ML Backend
│   ├── _wsgi.py                      <- WSGI entry point for container execution
│   ├── model.py                      <- Label Studio ML backend subclass code
│   └── setup_env.sh                  <- Shell bootstrapper for local host testing
│
└── tests/                            <- Complete unit & integration test suite

🛠️ Installation & Environment Setup

This project uses uv for python virtual environment compilation. Ensure it is installed on your machine.

  1. Clone the repository:
    git clone <repo-url>
    cd edumind
    
  2. Synchronize environment and install dependencies:
    make requirements
    

    This automatically builds a virtual environment under .venv/ and installs the package in editable mode.

  3. Configure Environment Variables: Copy the template file to .env and fill in your values (like LLM API keys):
    cp .env.example .env
    

🏃 Execution Guide

The system can be run in two main ways: Local Host Development or Containerized Docker Compose Stack.

1. Running the Streamlit Lecture Assistant

To launch the interactive frontend dashboard:

make app

Access the interface at http://localhost:8501.

2. Running the Label Studio Active Learning Stack

This launches both Label Studio UI and the EduMIND ML Backend in a shared Docker network:

# Start the stack in background
make docker-up

# Check container status
docker compose ps

# View logs
make docker-logs

# Stop the stack
make docker-down

Option B: Running Local on Host (Directly)

If you want to run Label Studio and the ML Backend natively on your host system:

# Installs Label Studio binaries and starts both servers in one terminal session
make run-ls

🧪 Testing & Code Quality

Running Tests

To run the complete suite of 50+ unit and integration tests:

make test

Checking Style & Formatting

Code formatting is strictly checked using Ruff. Always format your code before pushing changes:

# Auto-format and resolve lint errors
make format

# Dry-run check
make lint