Over the past years, the adoption of RocksDB increased dramatically. It became a standard for embeddable key-value stores.
Today RocksDB runs in production at Meta, Microsoft, Netflix, Uber. At Meta RocksDB serves as a storage engine for the MySQL deployment powering the distributed graph database.
Big tech companies are not the only RocksDB users. Several startups were built around RocksDB - CockroachDB, Yugabyte, PingCAP, Rockset.
I spent the past 4 years at Datadog building and running services on top of RocksDB in production. In this post, I'll give a high-level overview of how RocksDB works.
RocksDB is an embeddable persistent key-value store. It's a type of database designed to store large amounts of unique keys associated with values. The simple key-value data model can be used to build search indexes, document-oriented databases, SQL databases, caching systems and message brokers.
RocksDB was forked off Google's LevelDB in 2012 and optimized to run on servers with SSD drives. Currently, RocksDB is developed and maintained by Meta.
RocksDB is written in C++, so additionally to C and C++, the С bindings allow embedding the library into applications written in other languages such as Rust, Go or Java.
If you ever used SQLite, then you already know what an embeddable database is. In the context of databases, and particularly in the context of RocksDB, "embeddable" means:
The database doesn't have a standalone process; instead, it's integrated directly into your application as a library, sharing its resources and memory, removing the need for expensive inter-process communication.
It doesn't come with a built-in server that can be accessed over the network.
It is not distributed, meaning it does not provide fault tolerance, replication, or sharding mechanisms.
It is up to the application to implement these features if necessary.
RocksDB stores data as a collection of key-value pairs. Both keys and values are not typed, they are just arbitrary byte arrays. The database provides a low-level interface with a few functions for modifying the state of the collection:
put(key, value): stores a new key-value pair or updates an existing one
merge(key, value): combines the new value with the existing value for a given key
delete(key): removes a key-value pair from the collection
Values can be retrieved with point lookups:
get(key)
An iterator enables "range scans" - seeking to a specific key and accessing subsequent key-value pairs in order:
iterator.seek(key_prefix); iterator.value(); iterator.next()
The core data structure behind RocksDB is called the Log-structured merge-tree (LSM-Tree). It's a tree-like structure organized into multiple levels, with data on each level ordered by key. The LSM-tree was primarily designed for write-heavy workloads and was introduced in 1996 in a paper under the same name.
The top level of the LSM-Tree is kept in memory and contains the most recently inserted data. The lower levels are stored on disk and are numbered from 0 to N. Level 0 (L0) stores data moved from memory to disk, Level 1 and below store older data. When a level becomes too large, it's merged with the next level, which is typically an order of magnitude larger than the previous one.
Note
I'll be talking specifically about RocksDB, but most of the concepts covered apply to many databases that use LSM-trees under the hood (e.g. Bigtable, HBase, Cassandra, ScyllaDB, LevelDB, MongoDB WiredTiger).
To better understand how LSM-trees work, let's take a closer look at the write and read paths.
The top level of the LSM-tree is known as the MemTable. It's an in-memory buffer that holds keys and values before they are written to disk. All inserts and updates always go through the memtable. This is also true for deletes - rather than modifying key-value pairs in-place, RocksDB marks deleted keys by inserting a tombstone record.
The memtable is configured to have a specific size in bytes. When the memtable becomes full, it is swapped with a new memtable, the old memtable becomes immutable.
Note
The default size of the memtable is 64MB.
Let's start by adding a few keys to the database:
db.put("chipmunk", "1")
db.put("cat", "2")
db.put("raccoon", "3")
db.put("dog", "4")
As you can see, the key-value pairs in the memtable are ordered by the key. Although chipmunk was inserted first, it comes after cat in the memtable due to the sorted order. The ordering is a requirement for supporting range scans and it makes some operations, which I will cover later more efficient.
In the event of a process crash or a planned application restart, data stored in the process memory is lost. To prevent data loss and ensure that database updates are durable, RocksDB writes all updates to the Write-ahead log (WAL) on disk, in addition to the memtable. This way the database can replay the log and restore the original state of the memtable on startup.
The WAL is an append-only file, consisting of a sequence of records. Each record contains a key-value pair, a record type (Put/Merge/Delete), and a checksum. The checksum is used to detect data corruptions or partially written records when replaying the log. Unlike in the memtable, records in the WAL are not ordered by key. Instead, they are appended in the order in which they arrive.
RocksDB runs a dedicated background thread that persists immutable memtables to disk. As soon as the flush is complete, the immutable memtable and the corresponding WAL are discarded. RocksDB starts writing to a new WAL and a new memtable. Each flush produces a single SST file on L0. The produced files are immutable - they are never modified once written to disk.
The default memtable implementation in RocksDB is based on a skip list. The data structure is a linked list with additional layers of links that allow fast search and insertion in sorted order. The ordering makes the flush efficient, allowing the memtable content to be written to disk sequentially by iterating the key-value pairs. Turning random inserts into sequential writes is one of the key ideas behind the LSM-tree design.
Note
RocksDB is highly configurable. Like many other components, the memtable implementation can be swapped with an alternative. It's not uncommon to see self-balancing binary search trees used to implement memtables in other LSM-based databases.
SST files contain key-value pairs that have been flushed from memtable to disk in a format optimized for queries. SST stands for Static Sorted Table (or Sorted String Table in some other databases). This is a block-based file format that organizes data into blocks (the default size target is 4KB). Individual blocks can be compressed with various compression algorithms supported by RocksDB, such as Zlib, BZ2, Snappy, LZ4, or ZSTD. Similar to records in the WAL, blocks contain checksums to detect data corruptions. RocksDB verifies these checksums every time it reads from the disk.
Blocks in an SST file are divided into sections. The first section, the data section, contains an ordered sequence of key-value pairs. This ordering allows delta-encoding of keys, meaning that instead of storing full keys, we can store only the difference between adjacent keys.
While the key-value pairs in an SST file are stored in sorted order, binary search cannot always be applied, particularly when the blocks are compressed, making searching the file inefficient. RocksDB optimizes lookups by adding an index, which is stored in a separate section right after the data section. The index maps the last key in each data block to its corresponding offset on disk. Again, the keys in the index are ordered, allowing us to find a key quickly by performing a binary search. For example, if we are searching for lynx, the index tells us the key might be in the block 2 because lynx comes after chipmunk, but before raccoon.
In reality, there is no lynx in the SST file above, but we had to read the block from disk and search it. RocksDB supports enabling a bloom filter - a space-efficient probabilistic data structure used to test whether an element belongs to a set. It's stored in an optional bloom filter section and makes searching for keys that don't exist faster.
Additionally, there are several other less interesting sections, like the metadata section.
What I described so far is already a functional key-value store. But there are a few challenges that would prevent using it in a production system: space and read amplifications. Space amplification measures the ratio of storage space to the size of the logical data stored. Let's say, if a database needs 2MB of disk space to store key-value pairs that take 1MB, the space amplification is 2. Similarly, read amplification measures the number of IO operations to perform a logical read operation. I'll let you figure out what write amplification is as a little exercise.
Now, let's add more keys to the database and remove a few:
db.delete("chipmunk")
db.put("cat", "5")
db.put("raccoon", "6")
db.put("zebra", "7")
// Flush triggers
db.delete("raccoon")
db.put("cat", "8")
db.put("zebra", "9")
db.put("duck", "10")
As we keep writing, the memtables get flushed and the number of SST files on L0 keeps growing:
The space taken by deleted or updated keys is never reclaimed. For example, the cat key has three copies, chipmunk and raccoon still take up space on the disk even though they're no longer needed.
Reads get slower as their cost grows with the number of SST files on L0. Each key lookup requires inspecting every SST file to find the needed key.
A mechanism called compaction helps to reduce space and read amplification in exchange for increased write amplification. Compaction selects SST files on one level and merges them with SST files on a level below, discarding deleted and overwritten keys. Compactions run in the background on a dedicated thread pool, which allows RocksDB to continue processing read and write requests while compactions are taking place.
Leveled Compaction is the default compaction strategy in RocksDB. With Leveled Compaction, key ranges of SST files on L0 overlap. Levels 1 and below are organized to contain a single sorted key range partitioned into multiple SST files, ensuring that there is no overlap in key ranges within a level. Compaction picks files on a level and merges them with the overlapping range of files on the level below. For example, during an L0 to L1 compaction, if the input files on L0 span the entire key range, the compaction has to pick all files from L0 and all files from L1.
For this L1 to L2 compaction below, the input file on L1 overlaps with two files on L2, so the compaction is limited only to a subset of files.
Compaction is triggered when the number of SST files on L0 reaches a certain threshold (4 by default). For L1 and below, compaction is triggered when the size of the entire level exceeds the configured target size. When this happens, it may trigger an L1 to L2 compaction. This way, an L0 to L1 compaction may cascade all the way to the bottommost level. After the compaction ends, RocksDB updates its metadata and removes compacted files from disk.
Note
RocksDB provides other compaction strategies offering different tradeoffs between space, read and write amplification.
Remember that keys in SST files are ordered? The ordering guarantee allows merging multiple SST files incrementally with the help of the k-way merge algorithm. K-way merge is a generalized version of the two-way merge that works similarly to the merge phase of the merge sort.
With immutable SST files persisted on disk, the read path is less sophisticated than the write path. A key lookup traverses the LSM-tree from the top to the bottom. It starts with the active memtable, descends to L0, and continues to lower levels until it finds the key or runs out of SST files to check.
Here are the lookup steps:
Search the active memtable.
Search immutable memtables.
Search all SST files on L0 starting from the most recently flushed.
For L1 and below, find a single SST file that may contain the key and search the file.
Searching an SST file involves:
(optional) Probe the bloom filter.
Search the index to find the block the key may belong to.
Read the block and try to find the key there.
That's it!
Consider this LSM-tree:
Depending on the key, a lookup may end early at any step. For example, looking up cat or chipmunk ends after searching the active memtable. Searching for raccoon, which exists only on Level 1 or manul, which doesn't exist in the LSM-tree at all requires searching the entire tree.
RocksDB provides another feature that touches both read and write paths: the Merge operation. Imagine you store a list of integers in a database. Occasionally you need to extend the list. To modify the list, you read the existing value from the database, update it in memory and then write back the updated value. This is called "Read-Modify-Write" loop:
db = open_db(path)
// Read
old_val = db.get(key) // RocksDB stores keys and values as byte arrays. We need to deserialize the value to turn it into a list.
old_list = deserialize_list(old_val) // old_list: [1, 2, 3]
// Modify
new_list = old_list.extend([4, 5, 6]) // new_list: [1, 2, 3, 4, 5, 6]
new_val = serialize_list(new_list)
// Write
db.put(key, new_val)
db.get(key) // deserialized value: [1, 2, 3, 4, 5, 6]
The approach works, but has some flaws:
It's not thread-safe - two different threads may try to update the same key overwriting each other's updates.
Write amplification - the cost of the update increases as the value gets larger. E.g., appending a single integer to a list of 100 requires reading 100 and writing back 101 integers.
In addition to the Put and Delete write operations, RocksDB supports a third write operation, Merge, which aims to solve these problems. The Merge operation requires providing a Merge Operator - a user-defined function responsible for combining incremental updates into a single value:
func merge_operator(existing_val, updates) {
combined_list = deserialize_list(existing_val)
for op in updates {
combined_list.extend(op)
}
return serialize_list(combined_list)
}
db = open_db(path, {merge_operator: merge_operator})
// key's value is [1, 2, 3]
list_update = serialize_list([4, 5, 6])
db.merge(key, list_update)
db.get(key) // deserialized value: [1, 2, 3, 4, 5, 6]
The merge operator above combines incremental updates passed to the Merge calls into a single value. When Merge is called, RocksDB inserts only incremental updates into the memtable and the WAL. Later, during flush and compaction, RocksDB calls the merge operator function to combine the updates into a single large update or a single value whenever it's possible. On a Get call or an iteration, if there are any pending not-compacted updates, the same function is called to return a single combined value to the caller.
Merge is a good fit for write-heavy streaming applications that constantly need to make small updates to the existing values. So, where is the catch? Reads become more expensive - the work done on reads is not saved. Repetitive queries fetching the same keys have to do the same work over and over again until a flush and compaction are triggered. Like almost everything else in RocksDB, the behavior can be tuned by limiting the number of merge operands in the memtable or by reducing the number of SST files in L0.
If the performance is critical for your application, the most challenging aspect of using RocksDB is configuring it appropriately for a specific workload. RocksDB offers numerous configuration options, and tuning them often requires understanding the database internals and diving deep into the RocksDB source code:
"Unfortunately, configuring RocksDB optimally is not trivial. Even we as RocksDB developers don't fully understand the effect of each configuration change. If you want to fully optimize RocksDB for your workload, we recommend experiments and benchmarking, while keeping an eye on the three amplification factors."
Writing a production-grade key-value store from scratch is hard:
Hardware and OS can betray you at any moment, dropping or corrupting data.
Performance optimizations require a large time investment.
RocksDB solves this allowing you to focus on the business logic instead. This makes RocksDB an excellent building block for databases.
Full-Text Search (FTS) is a technique for searching text in a collection of documents. A document can refer to a web page, a newspaper article, an email message, or any structured text.
Today we are going to build our own FTS engine. By the end of this post, we'll be able to search across millions of documents in less than a millisecond. We'll start with simple search queries like "give me all documents that contain the word cat" and we'll extend the engine to support more sophisticated boolean queries.
Note
Most well-known FTS engine is Lucene (as well as Elasticsearch and Solr built on top of it).
Before we start writing code, you may ask "can't we just use grep or have a loop that checks if every document contains the word I'm looking for?". Yes, we can. However, it's not always the best idea.
We are going to search a part of the abstract of English Wikipedia. The latest dump is available at dumps.wikimedia.org. As of today, the file size after decompression is 913 MB. The XML file contains over 600K documents.
Document example:
<title>Wikipedia: Kit-Cat Klock</title>
<url>https://en.wikipedia.org/wiki/Kit-Cat_Klock</url>
<abstract>The Kit-Cat Klock is an art deco novelty wall clock shaped like a grinning cat with cartoon eyes that swivel in time with its pendulum tail.</abstract>
First, we need to load all the documents from the dump. The built-in encoding/xml package comes very handy:
import (
"encoding/xml"
"os"
)
type document struct {
Title string `xml:"title"`
URL string `xml:"url"`
Text string `xml:"abstract"`
ID int
}
func loadDocuments(path string) ([]document, error) {
f, err := os.Open(path)
if err != nil {
return nil, err
}
defer f.Close()
dec := xml.NewDecoder(f)
dump := struct {
Documents []document `xml:"doc"`
}{}
if err := dec.Decode(&dump); err != nil {
return nil, err
}
docs := dump.Documents
for i := range docs {
docs[i].ID = i
}
return docs, nil
}
Every loaded document gets assigned a unique identifier. To keep things simple, the first loaded document gets assigned ID=0, the second ID=1 and so on.
Now that we have all documents loaded into memory, we can try to find the ones about cats. At first, let's loop through all documents and check if they contain the substring cat:
func search(docs []document, term string) []document {
var r []document
for _, doc := range docs {
if strings.Contains(doc.Text, term) {
r = append(r, doc)
}
}
return r
}
On my laptop, the search phase takes 103ms - not too bad. If you spot check a few documents from the output, you may notice that the function matches caterpillar and category, but doesn't match Cat with the capital C. That's not quite what I was looking for.
We need to fix two things before moving forward:
Make the search case-insensitive (so Cat matches as well).
Match on a word boundary rather than on a substring (so caterpillar and communication don't match).
One solution that quickly comes to mind and allows implementing both requirements is regular expressions.
Here it is - (?i)\bcat\b:
(?i) makes the regex case-insensitive
\b matches a word boundary (position where one side is a word character and another side is not a word character)
func search(docs []document, term string) []document {
re := regexp.MustCompile(`(?i)\b` + term + `\b`) // Don't do this in production, it's a security risk. term needs to be sanitized.
var r []document
for _, doc := range docs {
if re.MatchString(doc.Text) {
r = append(r, doc)
}
}
return r
}
Ugh, the search took more than 2 seconds. As you can see, things started getting slow even with 600K documents. While the approach is easy to implement, it doesn't scale well. As the dataset grows larger, we need to scan more and more documents. The time complexity of this algorithm is linear - the number of documents required to scan is equal to the total number of documents. If we had 6M documents instead of 600K, the search would take 20 seconds. We need to do better than that.
To make search queries faster, we'll preprocess the text and build an index in advance.
The core of FTS is a data structure called Inverted Index. The Inverted Index associates every word in documents with documents that contain the word.
Example:
documents = {
1: "a donut on a glass plate",
2: "only the donut",
3: "listen to the drum machine",
}
index = {
"a": [1],
"donut": [1, 2],
"on": [1],
"glass": [1],
"plate": [1],
"only": [2],
"the": [2, 3],
"listen": [3],
"to": [3],
"drum": [3],
"machine": [3],
}
Below is a real-world example of the Inverted Index. An index in a book where a term references a page number:
Before we start building the index, we need to break the raw text down into a list of words (tokens) suitable for indexing and searching.
The text analyzer consists of a tokenizer and multiple filters.
The tokenizer is the first step of text analysis. Its job is to convert text into a list of tokens. Our implementation splits the text on a word boundary and removes punctuation marks:
func tokenize(text string) []string {
return strings.FieldsFunc(text, func(r rune) bool {
// Split on any character that is not a letter or a number.
return !unicode.IsLetter(r) && !unicode.IsNumber(r)
})
}
> tokenize("A donut on a glass plate. Only the donuts.")
["A", "donut", "on", "a", "glass", "plate", "Only", "the", "donuts"]
In most cases, just converting text into a list of tokens is not enough. To make the text easier to index and search, we'll need to do additional normalization.
In order to make the search case-insensitive, the lowercase filter converts tokens to lower case. cAt, Cat and caT are normalized to cat. Later, when we query the index, we'll lower case the search terms as well. This will make the search term cAt match the text Cat.
func lowercaseFilter(tokens []string) []string {
r := make([]string, len(tokens))
for i, token := range tokens {
r[i] = strings.ToLower(token)
}
return r
}
> lowercaseFilter([]string{"A", "donut", "on", "a", "glass", "plate", "Only", "the", "donuts"})
["a", "donut", "on", "a", "glass", "plate", "only", "the", "donuts"]
Almost any English text contains commonly used words like a, I, the or be. Such words are called stop words. We are going to remove them since almost any document would match the stop words.
There is no "official" list of stop words. Let's exclude the top 10 by the OEC rank. Feel free to add more:
var stopwords = map[string]struct{}{ // I wish Go had built-in sets.
"a": {}, "and": {}, "be": {}, "have": {}, "i": {},
"in": {}, "of": {}, "that": {}, "the": {}, "to": {},
}
func stopwordFilter(tokens []string) []string {
r := make([]string, 0, len(tokens))
for _, token := range tokens {
if _, ok := stopwords[token]; !ok {
r = append(r, token)
}
}
return r
}
> stopwordFilter([]string{"a", "donut", "on", "a", "glass", "plate", "only", "the", "donuts"})
["donut", "on", "glass", "plate", "only", "donuts"]
Because of the grammar rules, documents may include different forms of the same word. Stemming reduces words into their base form. For example, fishing, fished and fisher may be reduced to the base form (stem) fish.
Implementing a stemmer is a non-trivial task, it's not covered in this post. We'll take one of the existing modules:
import snowballeng "github.com/kljensen/snowball/english"
func stemmerFilter(tokens []string) []string {
r := make([]string, len(tokens))
for i, token := range tokens {
r[i] = snowballeng.Stem(token, false)
}
return r
}
> stemmerFilter([]string{"donut", "on", "glass", "plate", "only", "donuts"})
["donut", "on", "glass", "plate", "only", "donut"]
Note
A stem is not always a valid word. For example, some stemmers may reduce airline to airlin.
func analyze(text string) []string {
tokens := tokenize(text)
tokens = lowercaseFilter(tokens)
tokens = stopwordFilter(tokens)
tokens = stemmerFilter(tokens)
return tokens
}
The tokenizer and filters convert sentences into a list of tokens:
> analyze("A donut on a glass plate. Only the donuts.")
["donut", "on", "glass", "plate", "only", "donut"]
The tokens are ready for indexing.
Back to the inverted index. It maps every word in documents to document IDs. The built-in map is a good candidate for storing the mapping. The key in the map is a token (string) and the value is a list of document IDs:
type index map[string][]int
Building the index consists of analyzing the documents and adding their IDs to the map:
func (idx index) add(docs []document) {
for _, doc := range docs {
for _, token := range analyze(doc.Text) {
ids := idx[token]
if ids != nil && ids[len(ids)-1] == doc.ID {
// Don't add same ID twice.
continue
}
idx[token] = append(ids, doc.ID)
}
}
}
func main() {
idx := make(index)
idx.add([]document{{ID: 1, Text: "A donut on a glass plate. Only the donuts."}})
idx.add([]document{{ID: 2, Text: "donut is a donut"}})
fmt.Println(idx)
}
It works! Each token in the map refers to IDs of the documents that contain the token:
map[donut:[1 2] glass:[1] is:[2] on:[1] only:[1] plate:[1]]
To query the index, we are going to apply the same tokenizer and filters we used for indexing:
func (idx index) search(text string) [][]int {
var r [][]int
for _, token := range analyze(text) {
if ids, ok := idx[token]; ok {
r = append(r, ids)
}
}
return r
}
> idx.search("Small wild cat")
[[24, 173, 303, ...], [98, 173, 765, ...], [[24, 51, 173, ...]]
And finally, we can find all documents that mention cats. Searching 600K documents took less than a millisecond (18µs)!
With the inverted index, the time complexity of the search query is linear to the number of search tokens. In the example query above, other than analyzing the input text, search had to perform only three map lookups.
The query from the previous section returned a disjoined list of documents for each token. What we normally expect to find when we type small wild cat in a search box is a list of results that contain small, wild and cat at the same time. The next step is to compute the set intersection between the lists. This way we'll get a list of documents matching all tokens.
Luckily, IDs in our inverted index are inserted in ascending order. Since the IDs are sorted, it's possible to compute the intersection between two lists in linear time. The intersection function iterates two lists simultaneously and collects IDs that exist in both:
func intersection(a []int, b []int) []int {
maxLen := len(a)
if len(b) > maxLen {
maxLen = len(b)
}
r := make([]int, 0, maxLen)
var i, j int
for i < len(a) && j < len(b) {
if a[i] < b[j] {
i++
} else if a[i] > b[j] {
j++
} else {
r = append(r, a[i])
i++
j++
}
}
return r
}
Updated search analyzes the given query text, lookups tokens and computes the set intersection between lists of IDs:
func (idx index) search(text string) []int {
var r []int
for _, token := range analyze(text) {
if ids, ok := idx[token]; ok {
if r == nil {
r = ids
} else {
r = intersection(r, ids)
}
} else {
// Token doesn't exist.
return nil
}
}
return r
}
The Wikipedia dump contains only two documents that match small, wild and cat at the same time:
> idx.search("Small wild cat")
130764 The wildcat is a species complex comprising two small wild cat species, the European wildcat (Felis silvestris) and the African wildcat (F. lybica).
131692 Catopuma is a genus containing two Asian small wild cat species, the Asian golden cat (C. temminckii) and the bay cat.
The search is working as expected!
By the way, this is the first time I hear about catopuma, here is one of them:
We just built a Full-Text Search engine. Despite its simplicity, it can be a solid foundation for more advanced projects.
I didn't touch on a lot of things that can significantly improve the performance and make the engine more user friendly. Here are some ideas for further improvements:
Extend boolean queries to support OR and NOT.
Store the index on disk:
Rebuilding the index on every application restart may take a while.
Large indexes may not fit in memory.
Experiment with memory and CPU-efficient data formats for storing sets of document IDs. Take a look at Roaring Bitmaps.
Support indexing multiple document fields.
Sort results by relevance.
The full source code is available on GitHub.
The built-in string is represented internally as a structure containing two fields. Data is a pointer to the string data and Len is a length of the string:
type StringHeader struct {
Data uintptr
Len int
}
In Go strings are immutable, so multiple strings can share the same underlying data:
package main
import (
"fmt"
"reflect"
"unsafe"
)
// stringptr returns a pointer to the string data.
func stringptr(s string) uintptr {
return (*reflect.StringHeader)(unsafe.Pointer(&s)).Data
}
func main() {
s1 := "1234"
s2 := s1[:2] // "12"
// s1 and s2 are different strings, but they point to the same string data
fmt.Println(stringptr(s1) == stringptr(s2)) // true
}
Most of modern programming languages including Go intern compile-time string constants:
s1 := "12"
s2 := "1"+"2"
fmt.Println(stringptr(s1) == stringptr(s2)) // true
But strings generated at runtime are not interned:
s1 := "12"
s2 := strconv.Itoa(12)
fmt.Println(stringptr(s1) == stringptr(s2)) // false
To implement string interning we need a data structure representing a pool of strings. The pool needs to support two operations: adding a string to the pool and retrieving a string from the pool. A good candidate for the requirements is a hash map:
package main
import (
"fmt"
"strconv"
)
type stringInterner map[string]string
func (si stringInterner) Intern(s string) string {
if interned, ok := si[s]; ok {
return interned
}
si[s] = s
return s
}
func main() {
si := stringInterner{}
s1 := si.Intern("12")
s2 := si.Intern(strconv.Itoa(12))
fmt.Println(stringptr(s1) == stringptr(s2)) // true
}
Let's take a look at a few examples where interning could be useful.
Lexers, parsers and other text processing tools can greatly benefit from storing only distinct string values in memory.
The following program reads George Orwell's novel 1984 from a file and tokenizes it into a slice of words for further processing:
package main
import (
"bufio"
"log"
"os"
)
func main() {
f, err := os.Open("1984.txt")
if err != nil {
log.Fatal(err)
}
defer f.Close()
scanner := bufio.NewScanner(f)
scanner.Split(bufio.ScanWords)
var words []string
for scanner.Scan() {
words = append(words, scanner.Text())
}
if err := scanner.Err(); err != nil {
log.Fatal(err)
}
log.Print(len(words))
}
The book contains 103549 words, the length of all words combined is 483016 bytes. It's important to note that the number of unique words is much smaller - 15585.
If we take a look at a part of the words slice we can see the same words (IS in our example) having different addresses in memory:
fmt.Println(words[1111:1120])
for _, word := range words[1111:1120] {
fmt.Printf("%x ", stringptr(word))
}
[WAR IS PEACE FREEDOM IS SLAVERY IGNORANCE IS STRENGTH]
^ ^ ^
c0000159fa c0000159fe c000015a00 c000015a05 c000015a0c c000015a10 c000015a17 c000015a20 c000015a28
^ ^ ^
Let's modify the program to use string interning:
package main
import (
"bufio"
"log"
"os"
)
type stringInterner map[string]string
func (si stringInterner) InternBytes(b []byte) string {
if interned, ok := si[string(b)]; ok {
return interned
}
s := string(b)
si[s] = s
return s
}
func main() {
f, err := os.Open("1984.txt")
if err != nil {
log.Fatal(err)
}
defer f.Close()
scanner := bufio.NewScanner(f)
scanner.Split(bufio.ScanWords)
si := stringInterner{}
var words []string
for scanner.Scan() {
words = append(words, si.InternBytes(scanner.Bytes())) // intern words
}
if err := scanner.Err(); err != nil {
log.Fatal(err)
}
log.Print(len(words))
}
Now the words slice consists of strings that point to the string intern pool. All instances of IS have the same address in memory:
fmt.Println(words[1111:1120])
for _, word := range words[1111:1120] {
fmt.Printf("%x ", stringptr(word))
}
[WAR IS PEACE FREEDOM IS SLAVERY IGNORANCE IS STRENGTH]
^ ^ ^
c000015220 c0000146c8 c000015223 c000015228 c0000146c8 c000015230 c000015237 c0000146c8 c000015240
^ ^ ^
The amount of memory required to store the words decreased from 483016 to 119628 bytes which is more than 4x reduction.
Moreover, the []byte map key optimization helps to avoid expensive heap allocations. The Go compiler recognizes si[string(b)] and performs the lookup operation without allocating a new string and producing garbage.
Another example where string interning can be useful is network services. Interning can be applied to some string responses from databases, gRPC or HTTP servers.
The following snippet caches user information after querying it from a database:
type user struct {
ID int
Country string
}
type userService struct {
db *sql.DB
cache sync.Map
}
func (us *userService) get(id int) (*user, error) {
if u, ok := us.cache.Load(id); ok {
return u.(*user), nil
}
u := &user{}
row := us.db.QueryRow("SELECT id, country FROM users WHERE id=?", id)
if err := row.Scan(&u.ID, &u.Country); err != nil {
return nil, err
}
us.cache.Store(id, u)
return u, nil
}
There are about 200 countries as of December 2018, so storing only a single copy of each string can save us memory if the number of users we want to keep in cache is large.
The user service is called from an HTTP handler, which means the intern pool is required to be safe for concurrent access from multiple goroutines. Luckily Go 1.9 introduced a concurrent map - sync.Map.
The user service using a thread-safe version of stringInterner:
type stringInterner struct {
sync.Map
}
func (si *stringInterner) Intern(s string) string {
interned, _ := si.LoadOrStore(s, s)
return interned.(string)
}
type userService struct {
db *sql.DB
cache sync.Map
si stringInterner
}
func (us *userService) get(id int) (*user, error) {
if u, ok := us.cache.Load(id); ok {
return u.(*user), nil
}
u := &user{}
row := us.db.QueryRow("SELECT id, country FROM users WHERE id=?", id)
if err := row.Scan(&u.ID, &u.Country); err != nil {
return nil, err
}
u.Country = us.si.Intern(u.Country) // intern country
us.cache.Store(id, u)
return u, nil
}
Note
Keep in mind the intern pool only grows and never shrinks in the implementation above. One way to solve that is by having two maps in the pool (current and previous) and rotating them once in a while.
The Go standard library uses interning in a few places, one of them is net/textproto. textproto.Reader interns common HTTP headers (Content-Type, Host, User-Agent, etc.) to avoid memory allocations.
A decrease in memory usage is not the only advantage of string interning. Interned strings can be compared for equality in constant time. All the compiler needs to do is to check if two pointers are equal (Go source code) instead of going through the characters:
TEXT cmpbody<>(SB),NOSPLIT,$0-0
CMPQ SI, DI
JEQ allsame
I created a benchmark to see the performance difference between interned and non-interned strings:
package main
import (
"strings"
"testing"
)
func benchmarkStringCompare(b *testing.B, count int) {
s1 := strings.Repeat("a", count)
s2 := strings.Repeat("a", count)
b.ResetTimer()
for n := 0; n < b.N; n++ {
if s1 != s2 {
b.Fatal()
}
}
}
func benchmarkStringCompareIntern(b *testing.B, count int) {
si := stringInterner{}
s1 := si.Intern(strings.Repeat("a", count))
s2 := si.Intern(strings.Repeat("a", count))
b.ResetTimer()
for n := 0; n < b.N; n++ {
if s1 != s2 {
b.Fatal()
}
}
}
func BenchmarkStringCompare1(b *testing.B) { benchmarkStringCompare(b, 1) }
func BenchmarkStringCompare10(b *testing.B) { benchmarkStringCompare(b, 10) }
func BenchmarkStringCompare100(b *testing.B) { benchmarkStringCompare(b, 100) }
func BenchmarkStringCompareIntern1(b *testing.B) { benchmarkStringCompareIntern(b, 1) }
func BenchmarkStringCompareIntern10(b *testing.B) { benchmarkStringCompareIntern(b, 10) }
func BenchmarkStringCompareIntern100(b *testing.B) { benchmarkStringCompareIntern(b, 100) }
The speed of comparison of interned strings remains constant independently of the number of characters:
BenchmarkStringCompare1-4 500000000 2.93 ns/op
BenchmarkStringCompare10-4 300000000 6.21 ns/op
BenchmarkStringCompare100-4 100000000 13.2 ns/op
BenchmarkStringCompareIntern1-4 1000000000 2.60 ns/op
BenchmarkStringCompareIntern10-4 1000000000 2.60 ns/op
BenchmarkStringCompareIntern100-4 1000000000 2.60 ns/op
String interning can save memory at a cost of CPU time - every lookup from the intern pool requires hashing the input string (which may not work for CPU bound applications).
Here is a real-life example - memory usage of the application went down from 70% to 45% after I deployed a new version with string interning enabled:
Note
This post is outdated, please read the new design document on GitHub.
A few months ago I released the first version of an embedded on-disk key-value store written in Go. The store is about 10 times faster than LevelDB for random lookups. I'll explain why it's faster, but first let's talk about the reason why I decided to create my own key-value store.
I needed a store that could map keys to values with the following requirements:
The number of keys was large and I couldn't keep the items in memory (that would require switching to more expensive server instance type).
Low latency. I wanted to avoid network overhead if it was possible.
I needed to rebuild the mapping once a day and then access it in read-only mode.
The sequential lookup performance wasn't important, all I cared about was random lookup performance.
There are plenty of open-source key-value stores - the most popular are LevelDB, RocksDB and LMDB. Also there are some key-value stores implemented in Go - e.g. Bolt, goleveldb (port of LevelDB) and BadgerDB.
I tried to use LevelDB and RocksDB in production, but unfortunately, the observed results didn't meet the performance requirements.
The stores I mentioned above are not optimized for a specific use case - they are general-purpose solutions. For example, LevelDB supports range queries - allows iterating over keys in lexicographic order, supports caching, bloom filters and many other nice features, but I didn't need any of those features. The common implementation of key-value stores is based on tree algorithms. Usually it's B+ trees or LSM trees. Both data structures always require several I/O operations per lookup.
A hash table with a constant average lookup time seemed like a better choice. At least in theory a hash table beats a tree for my use case, so I decided to implement a simple on-disk hash table and benchmark it against LevelDB.
Pogreb uses two files to store keys and values.
The index file holds the header followed by an array of buckets.
+----------+
| header |
+----------+
| bucket 0 |
| ... |
| bucket N |
+----------+
Each bucket is an array of slots plus an optional file pointer to an overflow bucket. The number of slots in a bucket is hardcoded to 28 - that is the maximum number of slots that is possible to fit in 512 bytes.
+------------------------+
| slot 0 |
| ... |
| slot N |
+------------------------+
| overflow bucket offset |
+------------------------+
A slot contains the hash, the size of the key, the value size and a 64-bit offset of the key/value pair in the data file.
+------------------+
| key hash |
+------------------+
| key size |
+------------------+
| value size |
+------------------+
| key/value offset |
+------------------+
The data file stores keys, values, and overflow buckets.
Pogreb uses the Linear hashing algorithm which grows the index file one bucket at a time instead of rebuilding the whole hash table.
Initially, the hash table contains a single bucket (N=1).
Level L (initially L=0) represents the maximum number of buckets on a logarithmic scale the hash table can have. For example, a hash table with L=0 contains between 0 and 1 buckets; L=3 contains between 4 and 8 buckets.
S is the index of the "split" bucket (initially S=0).
Collisions are resolved using bucket chaining method. Overflow buckets are stored in the data file and form a linked list.
Position of the bucket in the index file is calculated by applying a hash function to the key:
Index file
+----------+
| bucket 0 | Bucket
| ... | +------------------------------+
h(key) -> | bucket X | -> | slot 0 ... slot Y ... slot N |
| ... | +------------------------------+
| bucket N | |
+----------+ |
|
v
Data file
+-----------+
| key-value |
+-----------+
To get the position of the bucket:
Hash the key (Pogreb uses the 32-bit version of MurmurHash3).
Use 2L bits of the hash to get the position of the bucket - hash % math.Pow(2, L).
If the calculated position comes before the split bucket S, the position is hash % math.Pow(2, L+1).
The lookup function reads a bucket at the given position from the index file and performs a linear search to find a slot with the required hash. If the bucket doesn't contain a slot with the required hash, but the pointer to the overflow bucket is non-zero, the overflow bucket is checked. The process continues until the required slot is found or until there is no more overflow buckets for the given key. Once a slot with the required key is found, pogreb reads the key/value pair from the data file.
The average lookup requires two I/O operations - one is to find a slot in the index file and another one is to read the key and value from the data file.
Insertion is performed by adding the key/value pair to the data file and updating a bucket in the index file. If the bucket has all of its slots occupied, a new overflow bucket is created in the data file.
When the number of items in the hash table exceeds the load factor threshold (70%), the split operation is performed on the split bucket S:
A new bucket is allocated at the end of the index file.
The split bucket index S is incremented.
If S points to 2L, S is reset to 0 and L is incremented.
The items from the old split bucket are separated between the newly allocated bucket and the old split bucket by recalculating the positions of the keys in the hash table.
The number of buckets N is incremented.
The removal operation lookups the bucket by key, removes the slot from the bucket, overwrites the bucket and then adds the offset of the key/value pair to the free list.
Pogreb maintains a list of blocks freed from the data file. The insertion operation tries to find a free block in the free list first before extending the data file. The free list implementation uses the "best-fit" algorithm for allocations. The free list can perform basic defragmentation by merging the adjacent free blocks.
The load factor is the ratio of the number of current items in the hash table to the number of total slots available. The load factor determines the point when the hash table should grow the index file. If the load factor is too large it can lead to a big number of overflow buckets. At the same time, a small load factor wastes a lot of space in the index file, which can also be a bad thing for the hash table performance because it leads to a poor caching.
For example, Python's dict load factor equals to ~66%, Java's HashMap uses 75% by default, in Go's map it's 65%.
Here is the benchmark results for different load factor values:
load factor |
reads per second |
index size (MB) |
---|---|---|
0.4 |
909488 |
45 |
0.5 |
908071 |
35 |
0.6 |
909977 |
30 |
0.65 |
921098 |
27.5 |
0.7 |
921432 |
25 |
0.75 |
910038 |
23.5 |
0.8 |
862662 |
22 |
0.9 |
731267 |
19.5 |
0.7 is a clear winner here.
Both index and data files are stored entirely on disk. The files are mapped into the process's virtual memory space using the mmap syscall on Unix-like operating systems or using the CreateFileMapping WinAPI function on Windows.
Switching from a regular file I/O to mmap for lookup operations gave nearly 100% performance boost.
Pogreb makes an important assumption - it assumes that all 512-byte sector disk writes are atomic. Each bucket is precisely 512 bytes and all allocations in the data file are 512-byte aligned. The insertion operation doesn't overwrite the old data in the data file - it writes a new key/value pair first, then updates the bucket in-place in the index file. If both operations are successful, the space allocated by the old key/value is added to the free list.
Pogreb doesn't flush changes to disk (fsync operation) automatically. Data that is not flushed may be lost in case of power loss. However, application crash shouldn't cause any issues. There is an option to make Pogreb fsync changes to disk after each write - set BackgroundSyncInterval to -1. Alternatively DB.Sync can be used at any time to force the OS to write the data to disk.
Pogreb supports multiple concurrent readers and a single writer - write operations are guarded with a mutex.
The API is similar to the LevelDB API - it provides methods like Put, Get, Has, Delete and supports iteration over the key/value pairs (the Items method):
package main
import (
"log"
"github.com/akrylysov/pogreb"
)
func main() {
// Opening a database.
db, err := pogreb.Open("example", nil)
if err != nil {
log.Fatal(err)
return
}
defer db.Close()
// Writing to a database.
err = db.Put([]byte("foo"), []byte("bar"))
if err != nil {
log.Fatal(err)
return
}
// Reading from a database.
val, err := db.Get([]byte("foo"))
if err != nil {
log.Fatal(err)
return
}
log.Printf("%s", val)
// Iterating over key/value pairs.
it := db.Items()
for {
key, val, err := it.Next()
if err == pogreb.ErrIterationDone {
break
}
if err != nil {
log.Fatal(err)
}
log.Printf("%s %s", key, val)
}
}
Unlike LevelDB's iterator, Pogreb's iterator returns key/value pairs in an unspecified order - you get constant time lookups in exchange for losing range and prefix scans.
All API methods are safe for concurrent use by multiple goroutines.
I created a tool to benchmark Pogreb, Bolt, goleveldb and BadgerDB. The tool generates a given number of random keys (16-64 byte length) and writes random values (128-512 byte length) for the generated keys. After successfully writing the items, pogreb-bench reopens the database, shuffles the keys and reads them back.
The benchmarks were performed on a DigitalOcean droplet with 8 CPUs / 16 GB RAM / 160 GB SSD + Ubuntu 16.04.3:
Choosing the right data structure and optimizing the store for the specific use case (random lookups) allowed to make Pogreb significantly faster than similar solutions.
Pogreb on Github - https://github.com/akrylysov/pogreb.
A few days ago Amazon announced an official Go support for AWS Lambda.
The API Gateway integration is straightforward, all you need to do is to import the github.com/aws/aws-lambda-go package, implement a Lambda handler and call the lambda.Start to register the handler:
package main
import (
"github.com/aws/aws-lambda-go/events"
"github.com/aws/aws-lambda-go/lambda"
)
func lambdaHandler(event events.APIGatewayProxyRequest) (events.APIGatewayProxyResponse, error) {
return events.APIGatewayProxyResponse{Body: "hi", StatusCode: 200}, nil
}
func main() {
lambda.Start(lambdaHandler)
}
I had a small service I wanted to port to Lambda:
// Very very simplified version of the service.
package main
import (
"fmt"
"net/http"
"strconv"
)
func indexHandler(w http.ResponseWriter, r *http.Request) {
w.Write([]byte("index"))
}
func addHandler(w http.ResponseWriter, r *http.Request) {
f, _ := strconv.Atoi(r.FormValue("first"))
s, _ := strconv.Atoi(r.FormValue("second"))
w.Header().Set("X-Hi", "foo")
fmt.Fprintf(w, "%d", f+s)
}
func main() {
http.HandleFunc("/", indexHandler)
http.HandleFunc("/add", addHandler)
http.ListenAndServe(":8080", nil)
}
I didn't want to rewrite any of my HTTP handlers. When I looked at the code, the first idea was that it would be nice to make the Lambda runtime support the existing net/http handlers.
The net/http server consists of two main parts:
The HTTP request multiplexer ServeMux (Handler interface) which routes all incoming requests.
The HTTP Server itself that handles the TCP connections.
You can replace any of these parts, e.g. you can plug in one of the custom HTTP routers (chi, gorilla mux, httprouter) that provide more features and/or better performance compared to ServeMux. Similarly, you can create your own Server implementation.
This flexibility makes many things easier, e.g. unit testing:
package main
import (
"net/http"
"net/http/httptest"
"testing"
)
func TestIndexHandler(t *testing.T) {
http.HandleFunc("/", indexHandler)
r := httptest.NewRequest("GET", "/", nil)
w := httptest.NewRecorder()
http.DefaultServeMux.ServeHTTP(w, r) // Note: you can directly call indexHandler here without involving ServeMux.
if w.Code != http.StatusOK || w.Body.String() != "index" {
t.Fail()
}
}
As you may have noticed, the Go standard library provides everything to construct http.ResponseWriter and http.Request objects. In order to make the Lambda SDK support net/http handlers we need to convert APIGatewayProxyRequest into http.Request, create a mock http.ResponseWriter, call the router's ServeHTTP and return a new APIGatewayProxyResponse. A basic implementation:
func lambdaHandler(event events.APIGatewayProxyRequest) (events.APIGatewayProxyResponse, error) {
r := httptest.NewRequest(event.HTTPMethod, event.Path, nil)
w := httptest.NewRecorder()
http.DefaultServeMux.ServeHTTP(w, r)
respEvent := events.APIGatewayProxyResponse{
Body: w.Body.String(),
StatusCode: w.Code,
}
return respEvent, nil
}
The code above will work for simple handlers, but it needs a few more things to be ready for production - support headers, query string parameters and binary responses. I created a package algnhsa, it can be used as a drop-in replacement for the net/http server:
package main
import (
"fmt"
"net/http"
"strconv"
"github.com/akrylysov/algnhsa"
)
func indexHandler(w http.ResponseWriter, r *http.Request) {
w.Write([]byte("index"))
}
func addHandler(w http.ResponseWriter, r *http.Request) {
f, _ := strconv.Atoi(r.FormValue("first"))
s, _ := strconv.Atoi(r.FormValue("second"))
w.Header().Set("X-Hi", "foo")
fmt.Fprintf(w, "%d", f+s)
}
func main() {
http.HandleFunc("/", indexHandler)
http.HandleFunc("/add", addHandler)
algnhsa.ListenAndServe(http.DefaultServeMux, nil)
}
On the API Gateway side define a proxy ANY method to handle requests to / and a catch-all {proxy+} resource to handle requests to every other path:
To make the API Gateway treat certain content types as binary, you need to add the desired types to your API's "Binary Media Types" (Settings section) and also pass them to the algnhsa.ListenAndServe function:
algnhsa.ListenAndServe(http.DefaultServeMux, []string{"image/jpeg", "image/png"}])
You can find the algnhsa package (Lambda Go net/http server adapter) on GitHub - https://github.com/akrylysov/algnhsa.
]]>package main
/*
#include <stdio.h>
void foo(int x) {
printf("x: %d\n", x);
}
*/
import "C"
func main() {
C.foo(C.int(123)) // x: 123
}
The identifiers declared in the embedded C code can be accessed using the "C" package - e.g. write C.foo(C.int(123)) to call the function foo. Pretty straightforward, right? This opens the door to a huge amount of code that was written in C or was written in any other language and provides C bindings.
Almost all libraries play by the rules, but some libraries that are written in C++ may throw exceptions. C doesn't support exceptions, therefore there is no way to catch them in C or Go/cgo. C function should never throw an exception, but nothing stops a developer from writing code that does it. A good example of such a library is a machine learning project Vowpal Wabbit - it's written in C++ and provides a C interface, which may throw an exception from time to time:
package main
// #cgo LDFLAGS: -lvw_c_wrapper
// #include <stdlib.h>
// #include <vowpalwabbit/vwdll.h>
import "C"
import "unsafe"
func main() {
cArgs := C.CString("invalid args")
defer C.free(unsafe.Pointer(cArgs))
C.VW_InitializeA(cArgs) // exception
}
Unhandled C++ exception on VW_InitializeA call just crashes the program:
SIGABRT: abort PC=0x7efc793a7428 m=0 sigcode=18446744073709551610 signal arrived during cgo execution goroutine 1 [syscall, locked to thread]: runtime.cgocall(0x4500b0, 0xc420049f58, 0xc420049f58) /usr/lib/go-1.8/src/runtime/cgocall.go:131 +0xe2 fp=0xc420049f28 sp=0xc420049ee8 main._Cfunc_VW_InitializeA(0x193c620, 0x0)
While cgo lets us call only C code, we still can link any C++ file to our program. This gives us an ability to create a C wrapper for VW_InitializeA that handles the exceptions. We can use a structure to return both the original return value and a pointer to the exception message (good old C doesn't support tuples or multiple return values):
// vw_wrapper.h
#ifdef __cplusplus
extern "C" {
#endif
#include <stdlib.h>
#include <vowpalwabbit/vwdll.h>
typedef struct VWW_HANDLE_ERR {
VW_HANDLE handle;
const char *pstrErr;
} VWW_HANDLE_ERR;
VW_DLL_MEMBER VWW_HANDLE_ERR VW_CALLING_CONV VWW_InitializeA(const char *pstrArgs);
#ifdef __cplusplus
}
#endif
The wrapper code is only a few lines, save it in the project directory as a .cpp file:
#include <string.h>
#include <exception>
#include "vw_wrapper.h"
VW_DLL_MEMBER VWW_HANDLE_ERR VW_CALLING_CONV VWW_InitializeA(const char *pstrArgs) {
VWW_HANDLE_ERR result = {0};
try {
result.handle = VW_InitializeA(pstrArgs);
} catch(std::exception &e) {
result.pstrErr = strdup(e.what());
}
return result;
}
The Go compiler will compile and link the C++ source with the program. Don't forget to explicitly free the strErr pointer if its value is not nil - Go garbage collector doesn't know how to deal with C pointers:
package main
// #cgo LDFLAGS: -lvw_c_wrapper
// #include <stdlib.h>
// #include "vw_wrapper.h"
import "C"
import (
"log"
"unsafe"
)
func main() {
cArgs := C.CString("invalid args")
defer C.free(unsafe.Pointer(cArgs))
ret := C.VWW_InitializeA(cArgs)
if ret.pstrErr != nil {
defer C.free(unsafe.Pointer(ret.pstrErr))
log.Println("VW error: ", C.GoString(ret.pstrErr))
}
}
Now, whenever the original VW_InitializeA throws an exception, the C++ wrapper will catch it and return the error message as a part of the VWW_HANDLE_ERR structure.
]]>Note
This post was updated on 2021-04-25.
Go has a powerful built-in profiler that supports CPU, memory, goroutine and block (contention) profiling.
Go provides a low-level profiling API runtime/pprof, but if you are developing a long-running service, it's more convenient to work with a high-level net/http/pprof package.
All you need to enable the profiler is to import net/http/pprof and it will automatically register the required HTTP handlers:
package main
import (
"net/http"
_ "net/http/pprof"
)
func hiHandler(w http.ResponseWriter, r *http.Request) {
w.Write([]byte("hi"))
}
func main() {
http.HandleFunc("/", hiHandler)
http.ListenAndServe(":8080", nil)
}
If your web application is using a custom URL router, you'll need to register a few pprof HTTP endpoints manually:
package main
import (
"net/http"
"net/http/pprof"
)
func hiHandler(w http.ResponseWriter, r *http.Request) {
w.Write([]byte("hi"))
}
func main() {
r := http.NewServeMux()
r.HandleFunc("/", hiHandler)
// Register pprof handlers
r.HandleFunc("/debug/pprof/", pprof.Index)
r.HandleFunc("/debug/pprof/cmdline", pprof.Cmdline)
r.HandleFunc("/debug/pprof/profile", pprof.Profile)
r.HandleFunc("/debug/pprof/symbol", pprof.Symbol)
r.HandleFunc("/debug/pprof/trace", pprof.Trace)
http.ListenAndServe(":8080", r)
}
That's it, launch the application, and then use the pprof tool:
go tool pprof [binary] http://127.0.0.1:8080/debug/pprof/profile
One of the biggest pprof advantages is that it has low overhead and can be used in a production environment on a live traffic without any noticeable performance penalties.
But before digging deeper into pprof, we need a real example which can show how to identify and fix performance issues in Go.
Assume you need to develop a brand-new microservice that adds a left padding to a given input string:
$ curl "http://127.0.0.1:8080/v1/leftpad/?str=test&len=10&chr=*"
{"str":"******test"}
The service needs to collect some basic metrics - the number of incoming requests and how long every request takes. All collected metrics are supposed to be sent to a metric aggregator (e.g. StatsD). In addition, the service needs to log the request details - URL, IP address and user-agent.
You can find the initial implementation on GitHub tagged as v1.
Compile and run the application:
go build && ./goprofex
We are going to measure how many requests per second the microservice is able to handle. This can be done using the Apache Benchmark tool:
ab -k -c 8 -n 100000 "http://127.0.0.1:8080/v1/leftpad/?str=test&len=50&chr=*"
# -k Enables HTTP keep-alive
# -c Number of concurrent requests
# -n Number of total requests to make
Not bad, but could be faster:
Requests per second: 22810.15 [#/sec] (mean)
Time per request: 0.042 [ms] (mean, across all concurrent requests)
Note
The benchmarking was performed on MacBook Pro Late 2013 (2.6 GHz Intel Core i5, 8 GB 1600 MHz DDR3, macOS 10.12.3) using Go 1.8.
Run the Apache benchmark tool again, but with a high number of requests (1 million should be enough) and at the same time run pprof:
go tool pprof http://127.0.0.1:8080/debug/pprof/profile
The CPU profiler runs for 30 seconds by default. It uses sampling to determine which functions spend most of the CPU time. The Go runtime stops the execution every 10 milliseconds and records the current call stack of all running goroutines.
When pprof enters the interactive mode, type top, the command will show a list of functions that appeared most in the collected samples. In our case these are all runtime and standard library functions, which is not very useful:
(pprof) top
63.77s of 69.02s total (92.39%)
Dropped 331 nodes (cum <= 0.35s)
Showing top 10 nodes out of 78 (cum >= 0.64s)
flat flat% sum% cum cum%
50.79s 73.59% 73.59% 50.92s 73.78% syscall.Syscall
4.66s 6.75% 80.34% 4.66s 6.75% runtime.kevent
2.65s 3.84% 84.18% 2.65s 3.84% runtime.usleep
1.88s 2.72% 86.90% 1.88s 2.72% runtime.freedefer
1.31s 1.90% 88.80% 1.31s 1.90% runtime.mach_semaphore_signal
1.10s 1.59% 90.39% 1.10s 1.59% runtime.mach_semaphore_wait
0.51s 0.74% 91.13% 0.61s 0.88% log.(*Logger).formatHeader
0.49s 0.71% 91.84% 1.06s 1.54% runtime.mallocgc
0.21s 0.3% 92.15% 0.56s 0.81% runtime.concatstrings
0.17s 0.25% 92.39% 0.64s 0.93% fmt.(*pp).doPrintf
There is a much better way to look at the high-level performance overview - -http flag:
go tool pprof -http=:8081 http://127.0.0.1:8080/debug/pprof/profile
When you run pprof with the -http flag, the tool opens the profile in the web browser. The graph allows to see all hot spots:
From the graph above you can see that the application spends a big chunk of CPU on logging, metric reporting and some time on garbage collection.
Use list to inspect every function in details, e.g. list leftpad:
(pprof) list leftpad
ROUTINE ======================== main.leftpad in /Users/artem/go/src/github.com/akrylysov/goprofex/leftpad.go
20ms 490ms (flat, cum) 0.71% of Total
. . 3:func leftpad(s string, length int, char rune) string {
. . 4: for len(s) < length {
20ms 490ms 5: s = string(char) + s
. . 6: }
. . 7: return s
. . 8:}
For those who are not afraid to look at the disassembled code, pprof includes disasm command, which helps to see the actual processor instructions:
(pprof) disasm leftpad
ROUTINE ======================== main.leftpad
20ms 490ms (flat, cum) 0.71% of Total
. . 1312ab0: GS MOVQ GS:0x8a0, CX
. . 1312ab9: CMPQ 0x10(CX), SP
. . 1312abd: JBE 0x1312b5e
. . 1312ac3: SUBQ $0x48, SP
. . 1312ac7: MOVQ BP, 0x40(SP)
. . 1312acc: LEAQ 0x40(SP), BP
. . 1312ad1: MOVQ 0x50(SP), AX
. . 1312ad6: MOVQ 0x58(SP), CX
...
Run the heap profiler:
go tool pprof http://127.0.0.1:8080/debug/pprof/heap
By default it shows the amount of memory currently in-use:
(pprof) top
512.17kB of 512.17kB total ( 100%)
Dropped 85 nodes (cum <= 2.56kB)
Showing top 10 nodes out of 13 (cum >= 512.17kB)
flat flat% sum% cum cum%
512.17kB 100% 100% 512.17kB 100% runtime.mapassign
0 0% 100% 512.17kB 100% main.leftpadHandler
0 0% 100% 512.17kB 100% main.timedHandler.func1
0 0% 100% 512.17kB 100% net/http.(*Request).FormValue
0 0% 100% 512.17kB 100% net/http.(*Request).ParseForm
0 0% 100% 512.17kB 100% net/http.(*Request).ParseMultipartForm
0 0% 100% 512.17kB 100% net/http.(*ServeMux).ServeHTTP
0 0% 100% 512.17kB 100% net/http.(*conn).serve
0 0% 100% 512.17kB 100% net/http.HandlerFunc.ServeHTTP
0 0% 100% 512.17kB 100% net/http.serverHandler.ServeHTTP
But we are more interested in the number of allocated objects. Call pprof with -alloc_objects option:
go tool pprof -alloc_objects http://127.0.0.1:8080/debug/pprof/heap
Almost 70% of all objects was allocated only by two functions - leftpad and StatsD.Send, we'll need to look at them closer:
(pprof) top
559346486 of 633887751 total (88.24%)
Dropped 32 nodes (cum <= 3169438)
Showing top 10 nodes out of 46 (cum >= 14866706)
flat flat% sum% cum cum%
218124937 34.41% 34.41% 218124937 34.41% main.leftpad
116692715 18.41% 52.82% 218702222 34.50% main.(*StatsD).Send
52326692 8.25% 61.07% 57278218 9.04% fmt.Sprintf
39437390 6.22% 67.30% 39437390 6.22% strconv.FormatFloat
30689052 4.84% 72.14% 30689052 4.84% strings.NewReplacer
29869965 4.71% 76.85% 29968270 4.73% net/textproto.(*Reader).ReadMIMEHeader
20441700 3.22% 80.07% 20441700 3.22% net/url.parseQuery
19071266 3.01% 83.08% 374683692 59.11% main.leftpadHandler
17826063 2.81% 85.90% 558753994 88.15% main.timedHandler.func1
14866706 2.35% 88.24% 14866706 2.35% net/http.Header.clone
Other useful options to debug memory issues are -inuse_objects - displays the count of objects in-use and -alloc_space - shows how much memory has been allocated since the program start.
Automatic memory management is convenient, but nothing is free in the world. Dynamic allocations are not only significantly slower than stack allocations but also affect the performance indirectly. Every piece of memory you allocate on the heap adds more work to the GC and makes it use more CPU resources. The only way to make the application spend less time on garbage collection is to reduce allocations.
Whenever you use the & operator to get a pointer to a variable or allocate a new value using make or new it doesn't necessary mean that it's allocated on the heap.
func foo(a []string) {
fmt.Println(len(a))
}
func main() {
foo(make([]string, 8))
}
In the example above make([]string, 8) is allocated on stack. Go uses escape analysis to determine if it's safe to allocate memory on stack instead of the heap. You can add -gcflags=-m option to see the results of escape analysis:
5 type X struct {v int}
6
7 func foo(x *X) {
8 fmt.Println(x.v)
9 }
10
11 func main() {
12 x := &X{1}
13 foo(x)
14 }
go build -gcflags=-m
./main.go:7: foo x does not escape
./main.go:12: main &X literal does not escape
Go compiler is smart enough to turn some dynamic allocations into stack allocations. Things get worse for example when you start dealing with interfaces:
// Example 1
type Fooer interface {
foo(a []string)
}
type FooerX struct{}
func (FooerX) foo(a []string) {
fmt.Println(len(a))
}
func main() {
a := make([]string, 8) // make([]string, 8) escapes to heap
var fooer Fooer
fooer = FooerX{}
fooer.foo(a)
}
// Example 2
func foo(a interface{}) string {
return a.(fmt.Stringer).String()
}
func main() {
foo(make([]string, 8)) // make([]string, 8) escapes to heap
}
Go Escape Analysis Flaws paper by Dmitry Vyukov describes more cases that escape analysis is unable to handle.
Generally speaking, you should prefer values over pointers for small structures that you don't need to change.
Note
For big structures, it might be cheaper to pass a pointer than to copy the whole structure and pass it by value.
Goroutine profile dumps the goroutine call stack and the number of running goroutines:
go tool pprof http://127.0.0.1:8080/debug/pprof/goroutine
There are only 18 active goroutines, which is very low. It's not uncommon to have thousands of running goroutines without significant performance degradation.
Blocking profile shows function calls that led to blocking on synchronization primitives like mutexes and channels.
Before running the block contention profile, you have to set a profiling rate using runtime.SetBlockProfileRate. You can add the call to your main or init function.
go tool pprof http://127.0.0.1:8080/debug/pprof/block
timedHandler and leftpadHandler spend a lot of time waiting on a mutex inside log.Printf. It happens because log package implementation uses a mutex to synchronize access to a file shared across multiple goroutines.
As we noticed before, the biggest offenders in terms of performance are log package, leftpad and StatsD.Send functions. Now we found the bottleneck, but before starting to optimize the code, we need a reproducible way to measure the performance of the code we are interested in. The Go testing package includes such a mechanism. You need to create a function in the form of func BenchmarkXxx(*testing.B) in a test file:
func BenchmarkStatsD(b *testing.B) {
statsd := StatsD{
Namespace: "namespace",
SampleRate: 0.5,
}
for i := 0; i < b.N; i++ {
statsd.Incr("test")
}
}
It's also possible to benchmark the whole HTTP handler using net/http/httptest package:
func BenchmarkLeftpadHandler(b *testing.B) {
r := httptest.NewRequest("GET", "/v1/leftpad/?str=test&len=50&chr=*", nil)
for i := 0; i < b.N; i++ {
w := httptest.NewRecorder()
leftpadHandler(w, r)
}
}
Run the benchmarks:
go test -bench=. -benchmem
It shows the amount of time each iteration takes and the amount of memory/number of allocations:
BenchmarkTimedHandler-4 200000 6511 ns/op 1621 B/op 41 allocs/op
BenchmarkLeftpadHandler-4 200000 10546 ns/op 3297 B/op 75 allocs/op
BenchmarkLeftpad10-4 5000000 339 ns/op 64 B/op 6 allocs/op
BenchmarkLeftpad50-4 500000 3079 ns/op 1568 B/op 46 allocs/op
BenchmarkStatsD-4 1000000 1516 ns/op 560 B/op 15 allocs/op
A good but not always obvious way to make the application faster is to make it do less work. Other than for debug purposes the line log.Printf("%s request took %v", name, elapsed) doesn't need to be in the web service. All unnecessary logs should be removed or disabled from the code before deploying it to production. This problem can be solved using a leveled logger - there are plenty of logging libraries.
Another important thing about logging (and about all I/O operations in general) is to use buffered input/output when possible which can help reduce the number of system calls. Usually, there is no need to write to a file on every logger call - use bufio package to implement buffered I/O. We can simply wrap the io.Writer object that we pass to a logger with bufio.NewWriter or bufio.NewWriterSize:
log.SetOutput(bufio.NewWriterSize(f, 1024*16))
Take a look at the leftpad function again:
func leftpad(s string, length int, char rune) string {
for len(s) < length {
s = string(char) + s
}
return s
}
Concatenating strings in a loop is not the smartest thing to do - every loop iteration allocates a new string. A better way to build a string is to use bytes.Buffer:
func leftpad(s string, length int, char rune) string {
buf := bytes.Buffer{}
for i := 0; i < length-len(s); i++ {
buf.WriteRune(char)
}
buf.WriteString(s)
return buf.String()
}
Alternatively, we can use string.Repeat which makes the code slightly shorter:
func leftpad(s string, length int, char rune) string {
if len(s) < length {
return strings.Repeat(string(char), length-len(s)) + s
}
return s
}
The next piece of code we need to change is StatsD.Send function:
func (s *StatsD) Send(stat string, kind string, delta float64) {
buf := fmt.Sprintf("%s.", s.Namespace)
trimmedStat := strings.NewReplacer(":", "_", "|", "_", "@", "_").Replace(stat)
buf += fmt.Sprintf("%s:%s|%s", trimmedStat, delta, kind)
if s.SampleRate != 0 && s.SampleRate < 1 {
buf += fmt.Sprintf("|@%s", strconv.FormatFloat(s.SampleRate, 'f', -1, 64))
}
ioutil.Discard.Write([]byte(buf)) // TODO: Write to a socket
}
Here are some possible improvements:
Sprintf is convenient for string formatting, and it's perfectly fine unless you start calling it thousands of times per second. It spends CPU time to parse the input format string, and it allocates a new string on every call. We can replace it with bytes.Buffer + Buffer.WriteString/Buffer.WriteByte.
The function doesn't need to create a new Replacer instance every time, it can be declared as a global variable or as a part of StatsD structure.
Replace strconv.FormatFloat with strconv.AppendFloat and pass it a buffer allocated on stack to prevent additional heap allocations.
func (s *StatsD) Send(stat string, kind string, delta float64) {
buf := bytes.Buffer{}
buf.WriteString(s.Namespace)
buf.WriteByte('.')
buf.WriteString(reservedReplacer.Replace(stat))
buf.WriteByte(':')
buf.Write(strconv.AppendFloat(make([]byte, 0, 24), delta, 'f', -1, 64))
buf.WriteByte('|')
buf.WriteString(kind)
if s.SampleRate != 0 && s.SampleRate < 1 {
buf.WriteString("|@")
buf.Write(strconv.AppendFloat(make([]byte, 0, 24), s.SampleRate, 'f', -1, 64))
}
buf.WriteTo(ioutil.Discard) // TODO: Write to a socket
}
That reduces the number of allocations from 14 to 1 and makes Send run about 4x faster:
BenchmarkStatsD-4 5000000 381 ns/op 112 B/op 1 allocs/op
The benchmarks show a very nice performance boost after all optimizations:
benchmark old ns/op new ns/op delta
BenchmarkTimedHandler-4 6511 1181 -81.86%
BenchmarkLeftpadHandler-4 10546 3337 -68.36%
BenchmarkLeftpad10-4 339 136 -59.88%
BenchmarkLeftpad50-4 3079 201 -93.47%
BenchmarkStatsD-4 1516 381 -74.87%
benchmark old allocs new allocs delta
BenchmarkTimedHandler-4 41 5 -87.80%
BenchmarkLeftpadHandler-4 75 18 -76.00%
BenchmarkLeftpad10-4 6 3 -50.00%
BenchmarkLeftpad50-4 46 3 -93.48%
BenchmarkStatsD-4 15 1 -93.33%
benchmark old bytes new bytes delta
BenchmarkTimedHandler-4 1621 448 -72.36%
BenchmarkLeftpadHandler-4 3297 1416 -57.05%
BenchmarkLeftpad10-4 64 24 -62.50%
BenchmarkLeftpad50-4 1568 160 -89.80%
BenchmarkStatsD-4 560 112 -80.00%
Note
I used benchstat to compare the results.
Run ab again:
Requests per second: 32619.54 [#/sec] (mean)
Time per request: 0.030 [ms] (mean, across all concurrent requests)
The web service can handle about 10000 additional requests per second now!
Avoid unnecessary heap allocations.
Prefer values over pointers for not big structures.
Preallocate maps and slices if you know the size beforehand.
Don't log if you don't have to.
Use buffered I/O if you do many sequential reads or writes.
If your application extensively uses JSON, consider utilizing parser/serializer generators (I personally prefer easyjson).
Every operation matters in a hot path.
Sometimes the bottleneck may be not what you are expecting - profiling is the best and sometimes the only way to understand the real performance of your application.
You can find the full source code on GitHub. The initial version is tagged as v1 and the optimized version is tagged as v2. Here is the link to compare these two versions.
You can find the source code of PhantomJS/Node.js web scraper for AWS Lambda at https://github.com/akrylysov/lambda-phantom-scraper.
]]>Note
This post was updated on 2016-08-13: added python-rapidjson; updated simplejson and ujson.
A couple of weeks ago after spending some time with Python profiler, I discovered that Python’s json module is not as fast as I expected. I decided to benchmark alternative JSON libraries.
simplejson 3.8.2
ujson 1.35
python-rapidjson 0.0.6
python-cjson, yajl-py and jsonlib are not included in the benchmark, they are not in active development and don’t support Python 3.
simplejson and ujson may be used as a drop-in replacement for the standard json module, but ujson doesn’t support advanced features like hooks, custom encoders and decoders.
You can change your imports this way to use an alternative library:
import ujson as json
Python (CPython) 2.7.12
Python (CPython) 3.5.2
PyPy 5.3.0
Note
ujson is not compatible with PyPy; python-rapidjson is compatible only with Python 3.
The tests were performed on MacBook Pro Late 2013 (2.6 GHz Intel Core i5, 8 GB 1600 MHz DDR3, Mac OS 10.11.6). Every test runs 100 times.
File name |
File size |
Description |
---|---|---|
twitter.json |
632 KB |
Single large JSON (source) |
one-json-per-line.jsons.txt |
176 KB |
Collection of 1000 JSON objects (source) |
I published the source code of the benchmark on GitHub. You can clone it and rerun if you want to check it by yourself or if a new version of an alternative JSON library is released.
json |
simplejson |
ujson |
|
---|---|---|---|
loads (large obj) |
1.140 |
0.441 |
0.448 |
dumps (large obj) |
0.564 |
0.630 |
0.459 |
loads (small objs) |
1.190 |
0.579 |
0.195 |
dumps (small objs) |
0.910 |
1.641 |
0.304 |
json |
simplejson |
ujson |
rapidjson |
|
---|---|---|---|---|
loads (large obj) |
0.600 |
0.698 |
0.605 |
0.634 |
dumps (large obj) |
0.673 |
0.629 |
0.381 |
0.365 |
loads (small objs) |
0.801 |
1.091 |
0.322 |
0.531 |
dumps (small objs) |
1.213 |
2.038 |
0.285 |
0.234 |
json |
simplejson |
|
---|---|---|
loads (large obj) |
0.545 |
1.876 |
dumps (large obj) |
0.632 |
3.974 |
loads (small objs) |
0.271 |
1.651 |
dumps (small objs) |
0.719 |
2.404 |
Note
Results are in seconds.
The numbers speak for themselves. If your application is dealing with a big amount of JSON data and doesn't use any advanced features of built-in json module, you should probably consider switching to ujson.
Я решил создать тестовое приложение, дабы замерить скорость работы regex_search:
#include <regex>
#include <iostream>
int find_position_re(std::string &text, std::regex ®ex)
{
std::smatch match;
if (std::regex_search(text, match, regex))
{
return match.position() + match.length();
}
return -1;
}
int main(int argc, char* argv[])
{
std::string text(1024 * 10000, ' ');
text.append("<body>asdasd</body>");
std::regex regex("<body[^>]*>");
float start = (float)clock() / CLOCKS_PER_SEC;
find_position_re(text, regex);
float end = (float)clock() / CLOCKS_PER_SEC;
std::cout << end - start << std::endl;
return 0;
}
На поиск в 10 мегабайтной строке потребовалось почти 200 миллисекунд, что не очень быстро, учитывая, что тестировалось на свежем MacBook Pro.
Для интереса я скомпилировал код с помощью GCC и Clang, написал аналогичный код на Python:
import re
import timeit
def find_position_re(text, regex):
match = regex.search(text)
if match:
return match.start()
return -1
text = ' ' * (1024 * 10000)
text += '<body>asdasd</body>'
regex = re.compile("<body[^>]*>")
print(timeit.timeit(lambda: find_position_re(text, regex), number=1))
И на JavaScript:
function find_position_re(text, regex) {
return text.search(regex);
}
var text = '';
for (var i = 0; i < 1024 * 10000; i++) {
text += ' ';
}
text += '<body>asdasd</body>';
var regex = new RegExp('<body[^>]*>');
console.time('test');
find_position_re(text, regex);
console.timeEnd('test');
Тестовое приложение на Node.js оказалось в 3, а на Python в 31 раз быстрее, чем на Visual Studio 2013.
Clang меня абсолютно разочаровал результатом. Немного покрутив флаги компилятора, я понял, что дело совсем не в них. Я запустил профайлер, чтобы найти узкое место. Им оказалась функция __match_at_start_ecma, которая вызывалась на каждую итерацию поиска:
template <class _CharT, class _Traits>
template <class _Allocator>
bool
basic_regex<_CharT, _Traits>::__match_at_start_ecma(
const _CharT* __first, const _CharT* __last,
match_results<const _CharT*, _Allocator>& __m,
regex_constants::match_flag_type __flags, bool __at_first) const
{
vector<__state> __states;
__node* __st = __start_.get();
if (__st)
{
__states.push_back(__state());
__states.back().__do_ = 0;
__states.back().__first_ = __first;
__states.back().__current_ = __first;
__states.back().__last_ = __last;
__states.back().__sub_matches_.resize(mark_count());
__states.back().__loop_data_.resize(__loop_count());
__states.back().__node_ = __st;
__states.back().__flags_ = __flags;
__states.back().__at_first_ = __at_first;
do
{
__state& __s = __states.back();
if (__s.__node_)
__s.__node_->__exec(__s);
switch (__s.__do_)
{
case __state::__end_state:
__m.__matches_[0].first = __first;
__m.__matches_[0].second = _VSTD::next(__first, __s.__current_ - __first);
__m.__matches_[0].matched = true;
for (unsigned __i = 0; __i < __s.__sub_matches_.size(); ++__i)
__m.__matches_[__i+1] = __s.__sub_matches_[__i];
return true;
case __state::__accept_and_consume:
case __state::__repeat:
case __state::__accept_but_not_consume:
break;
case __state::__split:
{
__state __snext = __s;
__s.__node_->__exec_split(true, __s);
__snext.__node_->__exec_split(false, __snext);
__states.push_back(_VSTD::move(__snext));
}
break;
case __state::__reject:
__states.pop_back();
break;
default:
#ifndef _LIBCPP_NO_EXCEPTIONS
throw regex_error(regex_constants::__re_err_unknown);
#endif
break;
}
} while (!__states.empty());
}
return false;
}
При длине строки в тысячу символов на моих тестовых данных эта функция вызывалась тысячу раз, что влекло за собой тысячу созданий, push_back, pop_back и уничтожений std::vector и соответственно тысячу динамических выделений и освобождений памяти.
В моем конкретном случае для моего регулярного выражения проблему можно решить таким костылем:
int find_position_re(std::string &text, std::regex ®ex)
{
std::string text2 = text.substr(text.find("<body"));
std::smatch match;
if (std::regex_search(text2, match, regex))
{
return match.position() + match.length();
}
return -1;
}
Это ускоряет работу примерно в 100 раз. Интересно, что аналогичный трюк в Python и Node.js влияет на скорость негативно.
Поддержка регулярных выражений в STL была добавлена в С++11 и реализована только в последних версиях компиляторов. Остается надеяться, что проблему производительности исправят в новых версиях.
Какой из этого всего можно сделать вывод? Производительность в первую очередь это не низкоуровневые (или новомодные) языки программирования и быстрые процессоры, а эффективные алгоритмы.
]]>Проблему разработки и сборки расширений под все популярные браузеры в большинстве случаев можно решить с помощью, например, Kango framework (кто не знает, Kango позволяет собирать расширения под Chrome, Firefox, Safari и Internet Explorer используя общий JavaScript код).
Информации о том, как лучшим образом настроить среду разработки браузерных расширений очень мало, поэтому хочу поделиться своим опытом.
Процесс разработки расширения обычно представляет собой:
Отредактировать исходный код.
Перейти в терминал и пересобрать расширение.
Переключиться в браузер и переустановить/перезагрузить расширение.
Процесс можно сократить всего до одного шага - пересобрать и переустановить расширение в браузере одним кликом не выходя из IDE.
Лучшего workflow удалось добиться используя Firefox.
В Firefox нет встроенного средства для автоматической установки расширений, но это легко решается с помощью Extension Auto-Installer от Wladimir Palant.
Чтобы установить расширение с помощью Extension Auto-Installer нужно послать XPI по HTTP по адресу http://localhost:8888/ (этот локальный веб-сервер поднимает Extension Auto-Installer). Сделать это можно, например, с помощью Wget (в ответ должно прийти 500 No Content):
wget --post-file=extension.xpi http://localhost:8888/
Полный shell скрипт для сборки и перезагрузки расширения под Mac OS и Linux:
#!/bin/sh
KANGODIR="../../framework/"
python $KANGODIR/kango.py build ./
wget --post-file=output/extension_name_1.0.0.xpi http://localhost:8888/
Под Windows:
SET KANGODIR=..\..\framework\
call "%KANGODIR%\kango.py" build .\
"C:\Program Files (x86)\GnuWin32\bin\wget" --post-file=output/extension_name_1.0.0.xpi http://localhost:8888/
Chrome не позволяет устанавливать расширения в автоматическом режиме, можно только перезагрузить уже установленное.
Extensions Reloader перезагружает все расширения, установленные в "development" режиме при открытии вкладки с адресом http://reload.extensions.
Под Chrome не все так гладко, как под Firefox, а именно:
Chrome не предоставляет способа полностью перезагрузить расширение, поэтому если были изменены настройки расширения (например, имя, описание и т.д.) необходимо открыть chrome://extensions/ и вручную нажать Reload для расширения.
Браузер забирает на себя фокус, так как перезагрузка расширения осуществляется открытием вкладки браузера.
Если был открыт инспектор background скриптов, то он будет закрыт.
Shell скрипт под Mac OS:
#!/bin/sh
KANGODIR="../../framework/"
python $KANGODIR/kango.py build ./
open /Applications/Google\ Chrome.app http://reload.extensions
Под Windows:
SET KANGODIR=..\..\framework\
call "%KANGODIR%\kango.py" build .\
start "http://reload.extensions"
Почти все IDE позволяют запускать shell скрипт по заданному действию, например, по горячей клавише или из GUI.
Как это сделать в PyCharm или WebStorm:
Выбираем в меню Run > Edit Configurations....
В открывшемся диалоге жмем + и в выпадающем меню выбираем Python для PyCharm или Node.js для WebStorm.
В области Before launch жмем + > Run External tool.
В открывшемся диалоге опять жмем +.
В поле Program вводим sh.
В поле Parameters вводим имя скрипта сборки, например, build.sh.
Готово, теперь по команде Run расширение пересобирается и обновляется в браузере не выходя из IDE.
Решил в качестве эксперимента начать вести бложик.
Раньше уже имел негативный опыт с WordPress (да и вообще не люблю я PHP) и после некоторых поисков в качестве движка был выбран Tinkerer.
Tinkerer это не блоговый движок в обычном понимании, а генератор блогов, написанный на Python. На локальной машине на вход подаются RST файлы, а на выходе получаются статические HTML файлы, готовые для выкладывания на сервер.
Для работы на сервере нужен только веб-сервер, умеющий отдавать статику.
Можно хостить на GitHub/BitBucket.
Посты можно хранить в своей системе контроля версий.
Посты пишутся в формате ReStructuredText в любом текстовом редакторе. Еще есть визуальный онлайн редактор.
Построен на основе Sphinx (крутая утилита для документирования проектов) и поддерживает все его фичи (например, подсветка синтаксиса).
Установка и генерация блога сводится всего к четырем командам:
easy_install -U Tinkerer
tinker --setup
tinker --post "Post 123"
tinker --build
В директории blog/html будут скомпилированные файлы блога.
Подробное описание формата ReStructuredText с примерами на сайте Sphinx http://sphinx-doc.org/rest.html.
Tinkerer на BitBucket https://bitbucket.org/vladris/tinkerer.
Более подробная документация по Tinkerer http://www.tinkerer.me/pages/documentation.html.