我们的分布式追踪系统 Jaeger 已经上线运行了一段时间,但团队始终面临一个棘手的诊断难题。当用户报告前端某个操作响应缓慢时,我们能在 Jaeger 中看到从 API 网关 Kong 开始到后端各个微服务的完整调用链,却唯独缺失了最关键的一环:这条链路究竟是由前端哪个具体的用户行为触发的?链路数据在网关处凭空出现,仿佛是无源之水。我们知道问题出在从浏览器到网关这一段的上下文丢失,但真正的挑战在于,即便我们能将前端的 Trace ID 传递到 Kong,获取到的信息依然贫乏。在真实项目中,我们需要的不仅仅是“哪个服务被调用了”,而是“哪个版本的服务实例,在哪个环境下,被前端的哪个组件调用了”。这些宝贵的元数据,分散在我们的服务注册中心 Consul 和前端应用逻辑中,而标准的 OpenTelemetry 插件并不能将它们自动缝合起来。
问题的核心演变为:如何构建一个机制,在 API 网关层,利用前端传递的追踪上下文,动态地从 Consul 查询服务元数据,并将这些信息作为属性(Attributes)注入到当前的追踪 Span 中,从而极大地丰富 Jaeger 中的链路信息,实现从用户点击到数据库查询的端到端、富元数据诊断。这需要我们跳出标准插件的限制,亲手打造一个连接 OpenTelemetry 上下文与 Consul 服务元数据的桥梁。
第一步:搭建可复现的基础设施
要解决这个问题,首先需要一个能够完整复现该场景的本地环境。我们使用 Docker Compose 来编排所有依赖组件:Kong 作为 API 网关、Consul 作为服务发现与配置中心、Jaeger 用于追踪数据收集与展示,以及一个用 Go 编写的简单后端服务 echo-service
。
# docker-compose.yml
version: '3.8'
services:
jaeger:
image: jaegertracing/all-in-one:1.41
ports:
- "16686:16686" # Jaeger UI
- "4317:4317" # OTLP gRPC receiver
environment:
- COLLECTOR_OTLP_ENABLED=true
consul:
image: consul:1.13
ports:
- "8500:8500" # Consul UI & HTTP API
command: "agent -server -ui -client=0.0.0.0 -bootstrap-expect=1"
echo-service:
build:
context: ./echo-service
ports:
- "8080" # Port is dynamic, exposed to host for debug only
environment:
- SERVICE_NAME=echo-service
- OTEL_EXPORTER_OTLP_ENDPOINT=http://jaeger:4317
# This service will register itself to Consul
- CONSUL_HTTP_ADDR=http://consul:8500
# We add version metadata during registration
- SERVICE_VERSION=1.2.3-beta
- DEPLOY_REGION=us-east-1
kong:
image: kong:3.0
ports:
- "8000:8000" # Proxy
- "8001:8001" # Admin API
environment:
KONG_DATABASE: 'off'
KONG_DECLARATIVE_CONFIG: /etc/kong/kong.yml
KONG_DNS_RESOLVER: consul:8500 # Tell Kong to use Consul for DNS
KONG_PLUGINS: bundled,consul-enricher # Enable our custom plugin
KONG_LUA_PACKAGE_PATH: /usr/local/share/lua/5.1/?.lua;;
KONG_LOG_LEVEL: debug
# OpenTelemetry plugin configuration
KONG_OPENTELEMETRY_TRACES_ENDPOINT: http://jaeger:4317
volumes:
- ./kong/config:/etc/kong/
- ./kong/plugins/consul-enricher:/usr/local/share/lua/5.1/kong/plugins/consul-enricher
depends_on:
- jaeger
- consul
- echo-service
networks:
default:
name: kong-jaeger-consul-net
这个 docker-compose.yml
文件是整个方案的基石。关键配置在于:
-
KONG_DNS_RESOLVER: consul:8500
: 指示 Kong 使用 Consul 作为其 DNS 解析器。这使得 Kong 可以通过echo-service.service.consul
这样的地址动态发现后端服务。 -
KONG_PLUGINS: bundled,consul-enricher
: 声明加载一个名为consul-enricher
的自定义插件。 -
KONG_LUA_PACKAGE_PATH
: 确保 Kong 能找到我们自定义插件的 Lua 代码。 -
volumes
: 我们将自定义插件的源代码目录挂载到 Kong 容器的插件路径下。
第二步:后端服务与服务注册
后端服务 echo-service
使用 Go 编写,它不仅需要处理业务请求,还承担着两个关键的云原生任务:集成 OpenTelemetry SDK 以便参与链路追踪,以及在启动时将自身注册到 Consul 并附带丰富的元数据。
// echo-service/main.go
package main
import (
"context"
"fmt"
"log"
"net/http"
"os"
"os/signal"
"syscall"
"time"
"github.com/google/uuid"
"github.com/hashicorp/consul/api"
"go.opentelemetry.io/contrib/instrumentation/net/http/otelhttp"
"go.opentelemetry.io/otel"
"go.opentelemetry.io/otel/exporters/otlp/otlptrace/otlptracegrpc"
"go.opentelemetry.io/otel/propagation"
"go.opentelemetry.io/otel/sdk/resource"
sdktrace "go.opentelemetry.io/otel/sdk/trace"
semconv "go.opentelemetry.io/otel/semconv/v1.12.0"
)
var serviceID string
var tracer = otel.Tracer("echo-service")
func initTracer() (*sdktrace.TracerProvider, error) {
exporter, err := otlptracegrpc.New(context.Background(), otlptracegrpc.WithInsecure())
if err != nil {
return nil, fmt.Errorf("failed to create OTLP gRPC exporter: %w", err)
}
res, err := resource.New(context.Background(),
resource.WithAttributes(
semconv.ServiceNameKey.String(os.Getenv("SERVICE_NAME")),
),
)
if err != nil {
return nil, fmt.Errorf("failed to create resource: %w", err)
}
tp := sdktrace.NewTracerProvider(
sdktrace.WithBatcher(exporter),
sdktrace.WithResource(res),
)
otel.SetTracerProvider(tp)
otel.SetTextMapPropagator(propagation.NewCompositeTextMapPropagator(propagation.TraceContext{}, propagation.Baggage{}))
return tp, nil
}
func registerServiceInConsul() {
serviceName := os.Getenv("SERVICE_NAME")
serviceVersion := os.Getenv("SERVICE_VERSION")
deployRegion := os.Getenv("DEPLOY_REGION")
config := api.DefaultConfig()
config.Address = os.Getenv("CONSUL_HTTP_ADDR")
consulClient, err := api.NewClient(config)
if err != nil {
log.Fatalf("Failed to create consul client: %v", err)
}
serviceID = fmt.Sprintf("%s-%s", serviceName, uuid.New().String())
port := 8080 // In a real app, this would be dynamic
registration := &api.AgentServiceRegistration{
ID: serviceID,
Name: serviceName,
Port: port,
Address: os.Getenv("HOSTNAME"), // Use hostname in docker network
Meta: map[string]string{
"version": serviceVersion,
"region": deployRegion,
},
Check: &api.AgentServiceCheck{
HTTP: fmt.Sprintf("http://%s:%d/health", os.Getenv("HOSTNAME"), port),
Interval: "10s",
Timeout: "1s",
DeregisterCriticalServiceAfter: "1m",
},
}
if err := consulClient.Agent().ServiceRegister(registration); err != nil {
log.Fatalf("Failed to register service with consul: %v", err)
}
log.Printf("Successfully registered service %s with Consul", serviceID)
}
func deregisterService(client *api.Client) {
if err := client.Agent().ServiceDeregister(serviceID); err != nil {
log.Printf("Failed to deregister service: %v", err)
} else {
log.Printf("Successfully deregistered service %s", serviceID)
}
}
func main() {
tp, err := initTracer()
if err != nil {
log.Fatalf("Failed to initialize tracer: %v", err)
}
defer func() {
if err := tp.Shutdown(context.Background()); err != nil {
log.Printf("Error shutting down tracer provider: %v", err)
}
}()
registerServiceInConsul()
// Graceful shutdown setup
quit := make(chan os.Signal, 1)
signal.Notify(quit, syscall.SIGINT, syscall.SIGTERM)
go func() {
<-quit
log.Println("Shutting down server...")
config := api.DefaultConfig()
config.Address = os.Getenv("CONSUL_HTTP_ADDR")
consulClient, _ := api.NewClient(config)
deregisterService(consulClient)
os.Exit(0)
}()
echoHandler := http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
// The otelhttp handler wrapper automatically creates a child span
_, span := tracer.Start(r.Context(), "handle-echo-request")
defer span.End()
w.WriteHeader(http.StatusOK)
w.Write([]byte("Request processed and traced.\n"))
})
healthHandler := http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
w.WriteHeader(http.StatusOK)
w.Write([]byte("OK"))
})
// Wrap handler with OpenTelemetry middleware
otelHandler := otelhttp.NewHandler(echoHandler, "echo-server")
http.Handle("/echo", otelHandler)
http.Handle("/health", healthHandler)
log.Println("Echo service listening on port 8080")
if err := http.ListenAndServe(":8080", nil); err != nil {
log.Fatalf("Server failed: %v", err)
}
}
这段 Go 代码的核心在于 registerServiceInConsul
函数。它从环境变量读取服务版本和部署区域,并将这些信息作为 Meta
字段注册到 Consul。这正是我们稍后希望在 Jaeger Span 中看到的数据。
第三步:前端追踪上下文的生成与传递
为了模拟前端环境,我们使用一段 JavaScript 代码,它利用 OpenTelemetry JS SDK 创建一个父 Span,然后通过 fetch
发出请求。这段代码通常会由 Babel 和 Webpack 等工具打包进现代前端应用中。
// A simplified example of frontend tracing instrumentation
// In a real project, this would be part of your application bootstrap code.
import { WebTracerProvider } from '@opentelemetry/sdk-trace-web';
import { SimpleSpanProcessor } from '@opentelemetry/sdk-trace-base';
import { OTLPTraceExporter } from '@opentelemetry/exporter-trace-otlp-http';
import { ZoneContextManager } from '@opentelemetry/context-zone';
import { registerInstrumentations } from '@opentelemetry/instrumentation';
import { FetchInstrumentation } from '@opentelemetry/instrumentation-fetch';
import { W3CTraceContextPropagator } from '@opentelemetry/core';
// --- Configuration ---
const provider = new WebTracerProvider({
resource: {
attributes: { 'service.name': 'browser-app' },
},
});
// For demo, we export to console. In prod, this would go to a collector.
// const exporter = new OTLPTraceExporter({ url: 'http://jaeger-collector/v1/traces' });
// provider.addSpanProcessor(new SimpleSpanProcessor(exporter));
provider.register({
contextManager: new ZoneContextManager(),
propagator: new W3CTraceContextPropagator(),
});
registerInstrumentations({
instrumentations: [
new FetchInstrumentation(),
],
});
const tracer = provider.getTracer('my-frontend-app');
// --- Business Logic ---
function performAction() {
// Create a parent span for the user action
const parentSpan = tracer.startSpan('user-clicks-submit-button');
// Set attributes relevant to the frontend context
parentSpan.setAttribute('user.id', 'user-12345');
parentSpan.setAttribute('component.name', 'CheckoutForm');
// Make the API call within the context of the parent span
tracer.withSpan(parentSpan, async () => {
try {
// The FetchInstrumentation will automatically create a child span
// and inject W3C Trace Context headers (traceparent, tracestate).
const response = await fetch('http://localhost:8000/api/echo', {
method: 'GET',
headers: { 'Content-Type': 'application/json' },
});
console.log('API call successful:', await response.text());
} catch (error) {
console.error('API call failed:', error);
parentSpan.recordException(error);
parentSpan.setStatus({ code: 2, message: error.message }); // 2 is ERROR status
} finally {
parentSpan.end(); // End the parent span when the action is complete
}
});
}
// Simulate a user action
document.getElementById('action-button').addEventListener('click', performAction);
这段代码的关键在于 W3CTraceContextPropagator
和 FetchInstrumentation
。当 fetch
被调用时,它们会自动在 HTTP 请求头中注入 traceparent
Header,格式类似于 00-4bf92f3577b34da6a3ce929d0e0e4736-00f067aa0ba902b7-01
。这个 Header 携带着 Trace ID 和 Parent Span ID,是实现跨服务上下文传递的标准方式。
第四步:自定义 Kong 插件的实现
现在进入核心部分:编写一个 Lua 插件,它将在 Kong 的请求处理流程中,从 Consul 获取元数据并附加到 OpenTelemetry Span 上。
首先,定义插件的 schema,这决定了插件有哪些可配置项。
-- kong/plugins/consul-enricher/schema.lua
return {
name = "consul-enricher",
fields = {
-- We can add configuration fields if needed, for now it's empty.
},
}
然后是插件的主体逻辑 handler.lua
。Kong 插件的生命周期函数中最适合我们这个场景的是 access
阶段,此时请求即将被代理到上游服务,并且 Kong 的 OpenTelemetry 插件已经创建了当前的 Server Span。
-- kong/plugins/consul-enricher/handler.lua
local kong = kong
-- A small utility for pretty printing tables for debugging
local function pretty_print(t)
local s = "{"
for k, v in pairs(t) do
s = s .. string.format(" [%s] = %s,", tostring(k), tostring(v))
end
return s .. " }"
end
local ConsulEnricherHandler = {
PRIORITY = 100, -- Lower than the opentelemetry plugin's priority (e.g. 199)
VERSION = "0.1.0",
}
function ConsulEnricherHandler:access(conf)
-- 1. Get the current span context created by the opentelemetry plugin.
-- The official plugin conveniently stores the span object in ngx.ctx.
local span = kong.request.get_span and kong.request.get_span()
if not span then
kong.log.debug("consul-enricher: no active span found in ngx.ctx, skipping.")
return
end
-- 2. Get the upstream service object that Kong has resolved for this request.
local service = kong.router.get_service()
if not service then
kong.log.debug("consul-enricher: no service associated with the route, skipping.")
return
end
-- 3. We are interested in services that are dynamically resolved by Consul.
-- Their host will be in the format `servicename.service.consul`.
-- We extract the service name from the host.
local service_name = service.host:match("([^.]+).service.consul")
if not service_name then
kong.log.debug("consul-enricher: service host '", service.host, "' is not a Consul service, skipping.")
return
end
kong.log.debug("consul-enricher: found Consul service: ", service_name)
-- 4. Use Kong's built-in DNS client to query Consul for service metadata.
-- This is a powerful, non-obvious feature. It respects Kong's DNS settings.
-- The query returns a list of records for the service. We only need the first one.
local records, err = kong.dns.query(service.host)
if err or not records or #records == 0 then
kong.log.err("consul-enricher: failed to resolve service from Consul: ", err)
return
end
-- 5. The metadata is available in the TXT records of the Consul DNS response.
-- It's a bit convoluted to parse, but it's reliable.
-- Example TXT record: "consul-service-meta-version=1.2.3-beta"
local txt_records, err = kong.dns.query(service_name .. ".service.consul", { qtype = kong.dns.TYPE_TXT })
if err or not txt_records then
kong.log.warn("consul-enricher: could not fetch TXT records for metadata: ", err)
else
for _, record in ipairs(txt_records) do
local key, value = record.txt:match("consul%-service%-meta%-([^=]+)=(.+)")
if key and value then
local attribute_key = "consul.service.meta." .. key
kong.log.debug("consul-enricher: setting span attribute: ", attribute_key, " = ", value)
-- This is the magic line: add metadata to the current span.
span:set_attribute(attribute_key, value)
end
end
end
-- Also add the resolved IP and Port as attributes for debugging.
local target_ip = records[1].ip
local target_port = records[1].port
span:set_attribute("net.peer.ip", target_ip)
span:set_attribute("net.peer.port", target_port)
kong.log.debug("consul-enricher: enriched span for service ", service_name, " targeting ", target_ip, ":", target_port)
end
return ConsulEnricherHandler
代码逻辑剖析:
-
kong.request.get_span()
: 这是获取由 OpenTelemetry 插件创建的当前 Span 对象的标准方式。 -
kong.router.get_service()
: 获取当前请求匹配到的 Kong Service 对象。 -
service.host:match(...)
: 我们通过匹配 service 的 host 字段是否为.service.consul
后缀来判断它是否是一个由 Consul 管理的服务。 -
kong.dns.query(...)
: 这是整个插件的关键。我们使用 Kong 内置的、异步的 DNS 客户端去查询 Consul。这比自己实现一个 HTTP 客户端去请求 Consul API 更高效、更优雅,因为它复用了 Kong 的连接池和 DNS 缓存机制。 - 解析TXT记录:Consul 的 DNS 接口会将服务的 Meta 数据编码在 TXT 记录中。我们通过正则表达式解析出这些键值对。
-
span:set_attribute(...)
: 将从 Consul 获取的元数据,如version
和region
,以consul.service.meta.version
这样的键名设置到 Span 的属性中。
第五步:整合与验证
最后一步是将所有配置整合起来,并验证最终效果。这是 Kong 的声明式配置文件 kong.yml
:
# kong/config/kong.yml
_format_version: "3.0"
_transform: true
services:
- name: echo-service-proxy
# Use Consul's service discovery name
host: echo-service.service.consul
port: 8080
protocol: http
plugins:
- name: opentelemetry
config:
# We don't need to sample; let the collector decide.
sampling_rate: 1
- name: consul-enricher # Enable our custom plugin on this service
routes:
- name: echo-route
paths:
- /api/echo
strip_path: true
methods:
- GET
service: echo-service-proxy
现在,启动整个环境 docker-compose up
。当所有服务正常运行后,在浏览器中打开一个简单的 HTML 页面,执行前面编写的 JavaScript 代码段来触发 fetch
请求。
接着,打开 Jaeger UI (http://localhost:16686
)。搜索 browser-app
服务,你会看到一条完整的链路:
graph TD A[browser-app: user-clicks-submit-button] --> B[browser-app: GET] B --> C[kong: GET /api/echo] C --> D[echo-service: echo-server] D --> E[echo-service: handle-echo-request]
这本身已经是标准的链路追踪了。但真正的价值在于点击 kong: GET /api/echo
这个 Span。在它的 “Tags” 或 “Attributes” 标签页中,你将看到我们通过自定义插件注入的宝贵信息:
-
consul.service.meta.version
:1.2.3-beta
-
consul.service.meta.region
:us-east-1
-
net.peer.ip
:172.x.x.x
(echo-service 容器的 IP) -
net.peer.port
:8080
当线上出现问题时,运维人员不再需要猜测是哪个版本的服务出了问题。他们可以直接在 Jaeger UI 中看到,这次缓慢的请求是被路由到了 1.2.3-beta
版本的、部署在 us-east-1
区域的服务实例上。结合前端 Span 中的 user.id
和 component.name
,我们拥有了从用户界面交互到特定后端服务实例的完整、详细的调试信息。
方案局限性与未来展望
这个方案有效地解决了问题,但并非没有权衡。每次请求通过 Kong,插件都会对 Consul 进行一次或多次 DNS 查询。尽管 Kong 的 DNS 模块有缓存,在高并发场景下,这仍然可能成为一个性能瓶颈或故障点。在真实项目中,需要对插件的性能进行压测,并考虑为 Consul 查询结果添加一层内存缓存(例如使用 lua-resty-lrucache
),并制定当 Consul 不可用时的优雅降级策略(例如,跳过增强步骤,只记录一条警告日志)。
此外,该实现强依赖于 Kong 内部的 kong.dns
API,未来 Kong 版本升级可能会带来兼容性风险。一个更具前瞻性的方向是探索使用 Kong 新的 Go 或 WebAssembly (WASM) 插件系统。用 Go 实现这个逻辑会更加健壮,测试也更方便,并且可以利用 Consul 官方的 Go SDK。WASM 则提供了跨语言的能力,让团队可以用 Rust 或 TinyGo 等语言编写高性能、安全沙箱化的插件,这或许是未来 API 网关可扩展性的最终形态。