C++20 模块化编程：从零实现一个简易模块化编译器

在 C++20 引入模块（Module）后，传统的头文件（#include）被重新定义，提供了更高效、更可靠的编译方式。本文将以一个极简示例，演示如何从头开始实现一个能处理 C++ 模块的简易编译器（仅限演示，功能有限）。目标是帮助读者理解模块编译流程，掌握关键步骤。

1. 模块化编译的基本概念

模块导出文件（Exported Interface）：模块的公共接口，使用 export module 声明。
模块实现文件（Implementation）：模块内部实现，使用 module 声明。
模块分离（Split）：模块的公共接口与实现可以分离编译，提升编译速度。
模块依赖：通过 import 语句引入其他模块。

2. 简易编译器的架构

简易编译器分为三大部分：

预处理器：解析 module 与 export 关键字，生成对应的内部结构。
模块解析器：根据预处理结果构建模块间的依赖图。
代码生成器：将模块代码合并为可执行文件（此处仅演示生成汇编文件）。

3. 预处理器实现

struct ModuleDef {
    std::string name;
    bool is_export;
    std::vector<std::string> imports;
    std::string body;
};

std::vector <ModuleDef> preprocess(const std::string& src) {
    std::vector <ModuleDef> modules;
    std::istringstream ss(src);
    std::string line;
    ModuleDef cur;
    bool in_module = false;
    while (std::getline(ss, line)) {
        std::istringstream lss(line);
        std::string token;
        lss >> token;
        if (token == "module" || token == "export") {
            cur.is_export = (token == "export");
            lss >> cur.name;
            in_module = true;
        } else if (token == "import") {
            std::string imp;
            lss >> imp;
            cur.imports.push_back(imp);
        } else if (token == "end") {
            modules.push_back(cur);
            cur = ModuleDef{};
            in_module = false;
        } else if (in_module) {
            cur.body += line + "\n";
        }
    }
    return modules;
}

此函数演示如何把源代码按模块拆分，并记录导入信息。

4. 依赖图构建

using Graph = std::unordered_map<std::string, std::vector<std::string>>;

Graph build_dependency(const std::vector <ModuleDef>& mods) {
    Graph g;
    for (const auto& m : mods) {
        g[m.name] = m.imports;
    }
    return g;
}

使用简单的字典存储模块间的依赖。

5. 简单拓扑排序

为了保证先编译依赖模块，再编译使用者模块，需要进行拓扑排序：

std::vector<std::string> topo_sort(const Graph& g) {
    std::vector<std::string> order;
    std::unordered_set<std::string> visited;
    std::function<void(const std::string&)> dfs = [&](const std::string& u){
        if (visited.count(u)) return;
        visited.insert(u);
        for (auto v : g.at(u)) dfs(v);
        order.push_back(u);
    };
    for (auto& [k, _] : g) dfs(k);
    std::reverse(order.begin(), order.end());
    return order;
}

6. 代码生成示例

在此简易编译器中，我们仅把模块的实现拼接为汇编代码，并通过 -lstdc++ 链接得到可执行文件。

void generate(const std::vector <ModuleDef>& mods, const std::vector<std::string>& order) {
    std::ofstream asm_out("out.s");
    asm_out << ".intel_syntax noprefix\n";
    for (const auto& name : order) {
        auto it = std::find_if(mods.begin(), mods.end(),
                               [&](const ModuleDef& m){ return m.name==name;});
        if (it != mods.end()) {
            asm_out << "// Module: " << it->name << "\n";
            asm_out << it->body << "\n";
        }
    }
    asm_out.close();
}

7. 主程序流程

int main() {
    std::ifstream in("sample.mod");
    std::string src((std::istreambuf_iterator <char>(in)), {});
    auto mods = preprocess(src);
    auto dep_graph = build_dependency(mods);
    auto order = topo_sort(dep_graph);
    generate(mods, order);
    // 调用系统编译器生成可执行文件
    std::system("gcc out.s -o out -lstdc++");
    return 0;
}

8. 示例源文件（sample.mod）

export module math::add;
int add(int a, int b) { return a + b; }
end

module main;
import math::add;
int main() {
    int r = add(2, 3);
    return r;
}
end

9. 运行与验证

$ ./compile_demo   # 生成 out
$ ./out
5

10. 讨论与扩展

错误处理：目前缺乏对循环依赖的检查。可以在拓扑排序前做环检测。
增量编译：通过缓存已编译模块的哈希值，仅在修改时重新编译。
更完整的语法分析：使用 Clang 的 libTooling 解析完整 C++ 语法。
多文件支持：拆分单文件到多文件，并通过 import 关联。

11. 结语

本文演示了从最基础的预处理到简易代码生成，构建一个处理 C++20 模块化的极简编译器。虽然功能有限，但核心思路与流程与实际编译器相似，为进一步学习模块化编译器提供了实战基础。希望读者能在此基础上继续扩展，实现更完整、更高效的模块化编译工具。