前端Monorepo的CI流水线执行时间已经成了一个无法回避的痛点。每次合并前,next build
命令都会消耗掉5到10分钟不等,这个时间窗口直接影响了我们的部署频率和开发人员的反馈循环。问题在于,我们对这个过程的认知是模糊的:哪个包的构建最慢?是依赖解析、代码转译还是资源压缩阶段的瓶颈?优化webpack.config.js
后,我们如何量化其带来的真实收益?没有数据,一切优化都等同于猜测。
最初的构想很简单:我们需要一个轻量级的遥测系统,专门用于捕获、存储和分析CI流水线中的构建数据。这个系统必须满足几个苛刻的条件:数据采集端点必须是异步、高性能的,能够承受未来上百个并发CI作业的冲击;数据存储必须是为分析而生的,能够快速对海量时序数据进行聚合查询;最后,整个系统的部署和维护,必须遵循我们团队已经全面采纳的GitOps理念,实现声明式管理。
技术选型决策
经过一番讨论和原型验证,我们敲定了如下技术栈:
- 数据采集端点: Sanic。选择Sanic而不是Flask或Django,是因为它的异步特性。CI流水线中的数据上报是一个典型的“发后即忘”(fire-and-forget)场景,采集端点需要极快地响应并释放CI Runner的连接。Sanic基于
uvloop
,性能极高,非常适合这种IO密集型的简单API服务。 - 数据存储与分析: ClickHouse。我们需要的不是一个通用的事务型数据库。我们的数据模型是事件流,写多读少,但读取的都是复杂的聚合分析查询(例如,计算过去30天某一个包构建时间的P95分位数)。ClickHouse的列式存储引擎和向量化查询执行能力,正是为这种OLAP场景量身打造的。
- 数据源: Webpack & Next.js。这是我们需要分析的对象。Next.js的构建过程底层由Webpack驱动,我们可以通过
--profile
标志生成详细的构建报告stats.json
,这里面包含了我们需要的所有原始数据。 - **部署与管理: CI/CD (GitLab CI) & GitOps (ArgoCD)**。整个遥测系统自身也应被视为一个生产应用。它的所有Kubernetes资源清单都存储在Git仓库中,由ArgoCD自动同步到集群。这种方式保证了环境的一致性和变更的可追溯性。
- 数据可视化: Next.js。使用我们最熟悉的技术栈来构建内部仪表盘,可以最大化地复用现有组件和开发经验。
架构概览
整个数据流和管理流程可以用下面的图来描述。
graph TD subgraph "GitLab CI Pipeline" A[next build --profile] --> B{stats.json}; B --> C[Python Telemetry Script]; C -- HTTP POST --> D[Sanic Ingestion API]; end subgraph "Kubernetes Cluster (Managed by ArgoCD)" D -- "Batch Insert" --> E[(ClickHouse)]; F[Next.js Dashboard] -- "SQL Query" --> E; end subgraph "Developer" G[Browser] --> F; end subgraph "GitOps Flow" H[Git Repo: K8s Manifests] -- "Sync" --> I[ArgoCD]; I -- "Deploys/Updates" --> D; I -- "Deploys/Updates" --> F; end
步骤化实现:从数据落库到GitOps部署
1. ClickHouse Schema设计
首先是地基。我们需要一张表来存储每次构建的遥测数据。在真实项目中,schema的设计至关重要,它直接决定了查询性能。
-- file: init-schema.sql
-- This schema is designed for high-volume writes and fast analytical queries.
-- We use a MergeTree engine, which is the cornerstone of ClickHouse.
CREATE DATABASE IF NOT EXISTS build_metrics;
CREATE TABLE IF NOT EXISTS build_metrics.webpack_build_stats (
-- Timestamp of the event, crucial for time-series analysis.
-- We use DateTime64(3) for millisecond precision.
event_timestamp DateTime64(3, 'Asia/Shanghai') DEFAULT now(),
-- Unique identifier for the CI job.
ci_job_id UInt64,
-- Git related information for context.
git_commit_sha String,
git_branch String,
git_ref String,
-- Project/Package identifier within a monorepo.
project_name String,
-- Overall build metrics.
build_duration_ms UInt32,
total_assets_size UInt64,
total_chunks Int32,
total_modules Int32,
-- A nested data structure to store detailed module information.
-- This allows us to analyze performance per module without complex joins.
modules Nested (
name String,
size UInt64,
type String,
build_time_ms UInt32
),
-- An array to store top N slowest plugins.
slowest_plugins Array(String),
-- LowCardinality is a performance optimization for strings with a small number of unique values.
build_status LowCardinality(String) -- e.g., 'SUCCESS', 'FAILURE'
) ENGINE = MergeTree()
PARTITION BY toYYYYMM(event_timestamp)
ORDER BY (project_name, event_timestamp)
SETTINGS index_granularity = 8192;
这里的关键在于ENGINE = MergeTree()
,PARTITION BY
和ORDER BY
。PARTITION BY toYYYYMM(event_timestamp)
按月对数据进行分区,可以极大加速按时间范围的查询,也便于数据管理(例如删除旧分区)。ORDER BY (project_name, event_timestamp)
定义了主键,数据在磁盘上会按此排序,这使得针对特定项目的查询非常高效。
2. Sanic数据采集端点
这个API服务的目标是极致的简单和高效。它只做一件事:接收JSON数据,校验格式,然后异步地批量写入ClickHouse。
# file: app.py
import os
import asyncio
import logging
from typing import List, Optional
from sanic import Sanic, Request, json
from sanic.exceptions import SanicException
from pydantic import BaseModel, Field
from asynch_ch import Client, connect
# --- Pydantic Models for Data Validation ---
# Using Pydantic provides robust validation and improves code clarity.
# A common mistake is to trust incoming data, which leads to data corruption.
class ModuleStat(BaseModel):
name: str
size: int
type: str
build_time_ms: int = Field(..., alias="buildTimeMs")
class BuildStatsPayload(BaseModel):
ci_job_id: int = Field(..., alias="ciJobId")
git_commit_sha: str = Field(..., alias="gitCommitSha")
git_branch: str = Field(..., alias="gitBranch")
git_ref: str = Field(..., alias="gitRef")
project_name: str = Field(..., alias="projectName")
build_duration_ms: int = Field(..., alias="buildDurationMs")
total_assets_size: int = Field(..., alias="totalAssetsSize")
total_chunks: int = Field(..., alias="totalChunks")
total_modules: int = Field(..., alias="totalModules")
modules: List[ModuleStat]
slowest_plugins: List[str] = Field(..., alias="slowestPlugins")
build_status: str = Field(..., alias="buildStatus")
# --- Sanic Application Setup ---
app = Sanic("BuildMetricsIngestor")
logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')
# --- ClickHouse Connection Management ---
# In a real production environment, connection details should come from environment variables or a secret manager.
CLICKHOUSE_HOST = os.getenv("CLICKHOUSE_HOST", "localhost")
CLICKHOUSE_PORT = int(os.getenv("CLICKHOUSE_PORT", 9000))
CLICKHOUSE_USER = os.getenv("CLICKHOUSE_USER", "default")
CLICKHOUSE_PASSWORD = os.getenv("CLICKHOUSE_PASSWORD", "")
CLICKHOUSE_DB = os.getenv("CLICKHOUSE_DB", "build_metrics")
# Global client connection pool
ch_client: Optional[Client] = None
@app.before_server_start
async def setup_db_client(app, loop):
"""Establishes the ClickHouse connection when the server starts."""
global ch_client
logging.info(f"Connecting to ClickHouse at {CLICKHOUSE_HOST}:{CLICKHOUSE_PORT}")
try:
ch_client = await connect(
host=CLICKHOUSE_HOST,
port=CLICKHOUSE_PORT,
user=CLICKHOUSE_USER,
password=CLICKHOUSE_PASSWORD,
database=CLICKHOUSE_DB,
secure=False,
)
await ch_client.execute("SELECT 1")
logging.info("ClickHouse connection successful.")
except Exception as e:
logging.error(f"Failed to connect to ClickHouse: {e}")
# Fail fast if the database is not available on startup.
raise SanicException("Database connection failed", status_code=503) from e
@app.before_server_stop
async def close_db_client(app, loop):
"""Gracefully disconnects from ClickHouse."""
if ch_client and ch_client.is_connected:
await ch_client.disconnect()
logging.info("ClickHouse connection closed.")
@app.post("/v1/ingest")
async def ingest_build_stats(request: Request):
"""
Main ingestion endpoint. Validates and inserts build statistics.
"""
if not ch_client:
return json({"error": "Database client not initialized"}, status=503)
try:
# Pydantic handles validation. If it fails, a validation error is raised.
payload = BuildStatsPayload(**request.json)
except Exception as e:
logging.warning(f"Invalid payload received: {e}")
return json({"error": "Invalid payload format", "details": str(e)}, status=400)
# In a high-throughput system, it's more efficient to insert data in batches.
# Here, for simplicity, we insert one record at a time. A more advanced implementation
# would use a background queue (e.g., asyncio.Queue) to buffer and batch inserts.
try:
query = "INSERT INTO build_metrics.webpack_build_stats VALUES"
# Prepare data matching the table schema, including nested structures.
data_to_insert = [
(
payload.ci_job_id,
payload.git_commit_sha,
payload.git_branch,
payload.git_ref,
payload.project_name,
payload.build_duration_ms,
payload.total_assets_size,
payload.total_chunks,
payload.total_modules,
[m.model_dump(by_alias=False) for m in payload.modules], # Convert Pydantic models to dicts for insertion
payload.slowest_plugins,
payload.build_status,
)
]
# The column names are implicitly matched by order here.
# A more robust approach would explicitly list column names in the INSERT query.
await ch_client.execute(query, data_to_insert)
except Exception as e:
logging.error(f"Failed to insert data into ClickHouse: {e}")
return json({"error": "Internal server error during data insertion"}, status=500)
return json({"status": "ok"}, status=202) # 202 Accepted is more appropriate for async processing
if __name__ == "__main__":
app.run(host="0.0.0.0", port=8000, access_log=False)
这个Sanic应用包含了生产级实践:Pydantic校验、优雅的数据库连接启停、详细的日志以及恰当的HTTP状态码。
3. CI流水线数据采集脚本
现在我们需要在CI流水线中生成并发送数据。这通常是在after_script
阶段执行的一个Python脚本。
# .gitlab-ci.yml snippet
build:my-next-app:
stage: build
image: node:18-alpine
script:
- yarn install
- yarn workspace my-next-app build --profile
# The build command generates stats.json in the project's .next/ directory.
after_script:
- apk add --no-cache python3 py3-pip
- pip install requests pydantic
- python3 scripts/send_build_metrics.py --stats-file ./apps/my-next-app/.next/stats.json --project-name my-next-app
对应的send_build_metrics.py
脚本负责解析stats.json
,并从CI环境变量中提取上下文信息。
# file: scripts/send_build_metrics.py
import os
import json
import argparse
import sys
import requests
import logging
logging.basicConfig(level=logging.INFO, format='%(levelname)s: %(message)s')
def parse_webpack_stats(stats_data):
"""
Extracts key metrics from the complex webpack stats.json object.
This function is where you can customize what data is most valuable to you.
"""
# A common pitfall is not handling cases where stats might be missing.
try:
# Extracting overall metrics
duration_ms = stats_data.get('time', 0)
total_assets_size = sum(asset.get('size', 0) for asset in stats_data.get('assets', []))
total_chunks = len(stats_data.get('chunks', []))
total_modules = len(stats_data.get('modules', []))
# Extracting detailed module data can be performance intensive if the stats file is huge.
# We cap it at 500 largest modules to keep the payload reasonable.
modules_data = sorted(
[
{
"name": m.get('name', 'unknown'),
"size": m.get('size', 0),
"type": m.get('moduleType', 'unknown'),
"buildTimeMs": m.get('profile', {}).get('building', 0)
} for m in stats_data.get('modules', []) if m.get('profile')
],
key=lambda x: x['size'],
reverse=True
)[:500]
# Extracting plugin performance data is crucial for optimization.
# The 'plugins' key might not exist in all webpack versions.
plugin_profiles = []
if 'profile' in stats_data and 'plugins' in stats_data['profile']:
plugin_profiles = sorted(
stats_data['profile']['plugins'],
key=lambda x: x['total'],
reverse=True
)
slowest_plugins = [f"{p['name']}:{p['total']:.2f}ms" for p in plugin_profiles[:10]]
return {
"buildDurationMs": duration_ms,
"totalAssetsSize": total_assets_size,
"totalChunks": total_chunks,
"totalModules": total_modules,
"modules": modules_data,
"slowestPlugins": slowest_plugins,
}
except Exception as e:
logging.error(f"Failed to parse stats.json: {e}")
return None
def main():
parser = argparse.ArgumentParser(description="Send Webpack build stats to a telemetry service.")
parser.add_argument("--stats-file", required=True, help="Path to the webpack stats.json file.")
parser.add_argument("--project-name", required=True, help="Name of the project being built.")
args = parser.parse_args()
# In a CI environment, these variables are typically pre-defined.
ingestor_url = os.getenv("METRICS_INGESTOR_URL")
if not ingestor_url:
logging.error("METRICS_INGESTOR_URL environment variable is not set. Exiting.")
sys.exit(1)
try:
with open(args.stats_file, 'r') as f:
stats_data = json.load(f)
except FileNotFoundError:
logging.error(f"Stats file not found at {args.stats_file}")
sys.exit(1)
except json.JSONDecodeError:
logging.error(f"Invalid JSON in {args.stats_file}")
sys.exit(1)
parsed_metrics = parse_webpack_stats(stats_data)
if not parsed_metrics:
sys.exit(1)
payload = {
"ciJobId": int(os.getenv("CI_JOB_ID", 0)),
"gitCommitSha": os.getenv("CI_COMMIT_SHA", "unknown"),
"gitBranch": os.getenv("CI_COMMIT_BRANCH", "unknown"),
"gitRef": os.getenv("CI_COMMIT_REF_NAME", "unknown"),
"projectName": args.project_name,
"buildStatus": "SUCCESS", # This could be dynamically set based on CI job status
**parsed_metrics
}
try:
response = requests.post(ingestor_url, json=payload, timeout=10)
response.raise_for_status()
logging.info(f"Successfully sent build metrics for {args.project_name}. Status: {response.status_code}")
except requests.RequestException as e:
logging.error(f"Failed to send metrics: {e}")
# In a robust system, you might want to retry or write to a file for later processing.
# For this example, we simply exit with an error code.
sys.exit(1)
if __name__ == "__main__":
main()
4. GitOps部署
我们将Sanic应用和Next.js仪表盘容器化,并为其编写Kubernetes清单。这些清单文件存放在一个独立的Git仓库中。
# file: gitops-repo/apps/build-metrics/ingestor-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: build-metrics-ingestor
namespace: platform-tools
spec:
replicas: 2
selector:
matchLabels:
app: build-metrics-ingestor
template:
metadata:
labels:
app: build-metrics-ingestor
spec:
containers:
- name: ingestor
image: your-registry/build-metrics-ingestor:v1.2.0 # Image tag is updated via CI
ports:
- containerPort: 8000
env:
- name: CLICKHOUSE_HOST
value: "clickhouse-service.data.svc.cluster.local"
# Other env vars like password should be loaded from a Kubernetes Secret
---
# file: gitops-repo/apps/build-metrics/ingestor-service.yaml
apiVersion: v1
kind: Service
metadata:
name: build-metrics-ingestor-svc
namespace: platform-tools
spec:
selector:
app: build-metrics-ingestor
ports:
- protocol: TCP
port: 80
targetPort: 8000
然后,在ArgoCD中创建一个Application
来指向这个Git仓库。
# file: argocd/app-of-apps/build-metrics-app.yaml
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
name: build-metrics-platform
namespace: argocd
spec:
project: default
source:
repoURL: 'https://github.com/your-org/gitops-repo.git'
targetRevision: HEAD
path: apps/build-metrics
destination:
server: 'https://kubernetes.default.svc'
namespace: platform-tools
syncPolicy:
automated:
prune: true
selfHeal: true
syncOptions:
- CreateNamespace=true
一旦这个Application
清单被应用到ArgoCD,它就会自动拉取gitops-repo
中的配置,并在platform-tools
命名空间中创建和管理我们的遥测服务。任何对Kubernetes清单的修改,只需提交到Git仓库,ArgoCD就会自动应用变更。这就是GitOps的核心魅力。
遗留问题与未来迭代
这个平台已经能回答我们最初的问题:哪个包构建慢,优化效果如何。但它还远非完美。当前的方案主要有以下几个局限性:
- 数据粒度: 我们只收集了Webpack构建数据。一个完整的DORA指标平台还需要覆盖测试执行时间、部署频率、变更失败率等。这需要从CI/CD系统、代码仓库和部署工具中集成更多的数据源。
- 异常检测: 目前的仪表盘只能被动地展示数据。下一步是引入基于统计模型的异常检测,当某个项目的构建时间出现异常抖动时,能主动发出告警。
- 成本与维护: ClickHouse虽然性能强大,但海量数据的存储成本和集群维护也是一个考量。需要制定清晰的数据生命周期管理策略(TTL),并对旧数据进行聚合归档,以控制成本。
- 采集脚本的鲁棒性: CI环境中的网络问题可能导致数据上报失败。一个更健壮的方案应该包含重试逻辑,或者在失败时将数据暂存为CI产物,由一个独立的作业进行重报。