记录 从sublime搜索结果显示为binary引发的思考和行动(WIP)

昨天加今天, 基本搞明白了一个困扰我很久的问题, 有很多收获, 在这里记录一下

最初的问题

  • 有一次在代码库里搜索一个关键字, sublime的结果显示搜索到了, 但是没有preview, 还把那个文件标为了”binary”, 就像这样
  • 当时初步定位到原因是 那个文件里包含了一个 ASCII control characters 里的 BS(backspace), 它在编辑器里渲染的不是正常的字符, 所以能看出它是特殊的, 但是当时对unicode/ascii 和字符编码都了解很少, 不知道是什么意思, 发现把这个字符删掉, 能让搜索结果恢复正常
  • 为什么这种字符让文件的搜索结果变成了binary?
  • 这些个特殊字符是什么?
  • 怎么能快速定位到他们?

学习过程

使用TDD实现这个CharDetector的体会和反思

感觉有价值的的参考资料

PS. 越写越长可不是好习惯!

回答最初问题

WIP

analyse-git-commit-count

INTRO

最近捣鼓出一个小工具, 用来分析某个git仓库里所有人的commit, 看看某个人在这个仓库里做过多少提交

虽然统计结果不可能完全准确, 但是足够满足好奇心了

核心: git shortlog --summary --numbered --email --all

SETUP

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
require 'pry'
require 'csv'
require 'tempfile'

def write_output_as_csv(output, output_csv_file_path)
headers = output.first.keys

CSV.open(output_csv_file_path, "w") do |csv|
csv << headers

output.each do |hash|
csv << hash.values
end
end
end

def datasource
`git shortlog --summary --numbered --email --all`.split("\n")
end

def output
records = []

datasource.each do |line|
result = line.match(/(\d{1,})(.*)(<.*>)/)

record = {}
record[:email] = result[3].strip
record[:count] = result[1].strip

records << record
end

deduplicate(records)
end

def deduplicate(arr)
arr.group_by {|r| r[:email]}.each_with_object([]) do |(email, records), mem|
mem << {email: email, count: records.map {|record| record[:count].to_i}.sum}
end.sort_by {|record| record[:count]}.reverse
end

def run
Tempfile.create(["gitinfo.", ".csv"]) do |file|
filepath = file.path

write_output_as_csv(output, filepath)

File.readlines(filepath).drop(1).each {|line| puts line}
end
end

run

HAVE FUN

# cd to any git repo
ruby $ABS_PATH/analyse-git-info.rb | uplot bar -o -d, -t "Git commit count of user"

TODO

  • 这个小工具做成gem
  • 用rspec写测试(练习…)
  • 怎么能使得它接收 stdin 也能正常工作呢?

RestClient Issue

记录一下使用RestClient这个Gem时遇到的一个坑.

版本: rest-client (2.1.0)

TLDR

如果直接使用封装过的 RestClient.get/RestClient.put/RestClient.post 等方法, 当遇到异常的响应时(比如400 bad request), 得不到任何有用的信息

应该使用带块的方式调用, 用块参数接收 response, request 和 result, 不使用块会导致异常时丢失信息

示例

用rails准备一个简单的接口, 响应 400, 并返回错误的信息

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
# config/routes.rb
post 'demo', to: "demo#test"

# app/controllers/demo_controller.rb
class DemoController < ApplicationController
def test
render :json => {msg: 'this msg explains why this is a bad request'}, :status => :bad_request
end
end

# 进入 rails console

# case1: 不用块
[23] pry(main)> RestClient.post("localhost:3000/demo", {}) # => nil
# RestClient::BadRequest: 400 Bad Request
# from /Users/lijunwei/.rvm/gems/ruby-2.6.3@api-provider/gems/rest-client-2.1.0/lib/restclient/abstract_response.rb:249:in `exception_with_response'

# case2: 用块
[24] pry(main)> RestClient.post("localhost:3000/demo", {}) {|response, request, result| puts "response.body: #{response.body}\nrequest.body: #{request.args}\nresult.body: #{result.body}"} # => nil
# response.body: {"msg":"this msg explains why this is a bad request"}
# request.body: {:method=>:post, :url=>"localhost:3000/demo", :payload=>{}, :headers=>{}}
# result.body: {"msg":"this msg explains why this is a bad request"}

可以看到, 用块的这个可以获取到响应里返回的详细信息

这个区别是在调用JIRA7和企业微信的API时发现的, 现象是: 用restclient调用api只返回了400, 用postman调试却能得到错误信息, 使用net/http调试, 也能得到错误信息

经过调试和阅读文档才意识到, 信息是被RestClient给吞了

看源码可以看到, 4XX和5XX的状态吗, 如果响应里有信息, RestClient是不会解析和返回的, 只会包装一个对应的异常

看这注释的意思, 这是个feature, 不是个bug, 但是假如确实有人在 4XX 响应里返回了信息(就像JIRA7和企业微信机器人接口那样), 那使用RestClient, 就得小心了…

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
# lib/restclient/abstract_response.rb

# Return the default behavior corresponding to the response code:
#
# For 20x status codes: return the response itself
#
# For 30x status codes:
# 301, 302, 307: redirect GET / HEAD if there is a Location header
# 303: redirect, changing method to GET, if there is a Location header
#
# For all other responses, raise a response exception
#
def return!(&block)
case code
when 200..207
self
when 301, 302, 307
case request.method
when 'get', 'head'
check_max_redirects
follow_redirection(&block)
else
raise exception_with_response
end
when 303
check_max_redirects
follow_get_redirection(&block)
else
raise exception_with_response
end
end

def exception_with_response
begin
klass = Exceptions::EXCEPTIONS_MAP.fetch(code)
rescue KeyError
raise RequestFailed.new(self, code)
end

raise klass.new(self, code)
end

结论: 如果使用RestClient, 一定要使用块; 如果用其他lib, 需要注意一下有没有类似的问题

思路.1 使用Request.execute(:method => :get, :url => url, :headers => headers, &block) 封装自己的请求

思路.2 使用RestClient.get等封装后的方法, 并使用块

示例:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
def make_qywxrobot_request(url, payload)
RestClient.post(url, payload.to_json, content_type: :json) do |response, request, result|
response_body = JSON.parse(response)
response_code = response.code
request_id = SecureRandom.hex
errcode = response_body["errcode"]

hash = {}
hash[:request_id] = request_id
hash[:request_args] = request.args
hash[:response_code] = response_code
hash[:response_body] = response_body

if response_code == 200 && errcode == 0
return response_body
elsif response_code == 200 && known_qywx_errcode?(errcode)
# exception handler 1
Rails.logger.error("#{__method__} #{hash.to_json}")
else
# exception handler 2
Rails.logger.error("#{__method__} #{hash.to_json}")
raise "调用企业微信接口发送告警消息失败"
end
end
end

# https://developer.work.weixin.qq.com/document/path/95390
def known_qywx_errcode?(errcode)
errcode == 45009 ||
errcode == 45033
end

思路.3 用ruby自带的Net::HTTP自己封装吧

目前没看出RestClient有什么优势…(可能是没遇到很复杂的http请求的场景)

反思

  1. 又仔细读一下文档, 发现文档里还是提供了些头绪的, 只是遇到问题时没读懂

  2. 这个”信息被吞”的问题, 和Error Handling有点像, 很多时候异常被吞掉是很恼人的事情, 一定要想清楚再决定是否 rescue

  • 这里有几点体会:
    • 代码里应该尽量少写begin, rescue, 尤其是rescue所有Exception更要少些, 代码的可读性会有提升, 可维护性也会变好一些, 因为出错时会崩, 崩了能找到源头; 如果满篇rescue, 那么排查起来就会很费劲了
    • 一定要理解rescue Exception => erescue => e的区别, 多数情况下前者是万万不可的
    • 必要的地方要做容错处理, 不能崩; 但是这种地方如果崩了, 要能及时发出告警, 记录好现场数据以备排查和修复, 绝对不能简单吞了完事
    • 捕获的异常越具体, 或者说处理异常的代码越少, 说明写代码时考虑的越周到(前提是这种异常确实会发送), 代码会干净很多, 这样的代码无论是使用、阅读还是维护, 都会很舒服
  1. 没必要时, 可以不考虑使用lib(例入写gem时, 要尽量少的引入依赖)

  2. 使用开源lib时, 最好能先了解它, 不要拿来就用, 不然遇到了奇怪的问题时会很头疼; 如果有安全问题也会很麻烦的, 甚至会有巨大的损失

migrate_to_frozen_string_literal

I’ve read a blog post written by Mike Perham introducing the Magic Comment, and I tried it out in my project.

The # frozen_string_literal: true

STEP-1: add this “magic comment”

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
# Find all ruby files.
# Iterate through them.
# Add two lines if the file doesn't have those magic comment.

basedir = Rails.root.to_s
filenames = Dir["#{basedir}/**/*.rb"]
target = "# frozen_string_literal: true"

filenames.each do |filename|
lines = File.foreach(filename).first(2)

if lines != ["#{target}\n", "\n"]
`gsed -i "1i#{target}" "#{filename}"` # Use gsed on macos.
`gsed -i "1G" "#{filename}"` # Use sed on linux.
end
end

STEP-2: do automated/manual tests

This is important, since your project code may have a situation for manipulating Mutable String.

STEP-3: deploy and pay extra attention to production state

Be ready to rollback your deployment. You know, shit happens.

Exception occurred: FrozenError

Yes, it happened…

1
2
3
4
5
6
7
8
9
10
11
12
13
{
"class": "ScanAlertingAndPendingWorker",
"args": [],
"retry": 1,
"queue": "default",
"jid": "0a48de6e85b383432760b013",
"created_at": 1646124005.491867,
"enqueued_at": 1646124034.8366895,
"error_message": "can't modify frozen String",
"error_class": "FrozenError",
"failed_at": 1646124035.9765105,
"retry_count": 0,
}

Occurrence No.1

1
2
3
4
5
# frozen_string_literal: true

content = ""
content << 'world'
content << 'hello'

solution

1
2
3
4
5
# frozen_string_literal: true

content = String.new
content << 'world'
content << 'hello'

Occurrence No.2

1
2
3
4
# frozen_string_literal: true

body = 'Roses are red Mud is fun'
body.force_encoding('utf-8')

solution

1
2
3
4
5
6
# frozen_string_literal: true

body = 'Roses are red Mud is fun'
body.dup.force_encoding('utf-8')
# or
String.new(body).force_encoding('utf-8')

At Last

But to my disappointment, I didn’t see significant memory reduction.

It might be related to the size of the system.(Is it?)

Tradeoff Collection

I observed a lot of trade-offs.
I would call Trade Off some kind of Art.
Tradeoff collection therein.

Space | Time Tradeoff

A space–time Tradeoff or time–memory trade-off in computer science is a case where an algorithm or program trades increased space usage with decreased time.

Here, space refers to the data storage consumed in performing a given task (RAM, HDD, etc), and time refers to the time consumed in performing a given task (computation time or response time).

Code Changeability/Simplicity | Code Readability Tradeoff

hint from “99-bottles-of-oop”

  • abstraction -> changeability(hard to understand)
  • concrete -> easy to read(hard to change)

Security | Effiency Tradeoff

cryptography

  • The more complex the encryption algorithm is, the more secure it will be, while it’ll take more time to encrypt/dencrypt

even good (or trendy) solutions add complexity