Linux 3.0 kernel, what changed after all?

What were the most radical changes in Kernel 3?

In addition to a new version numbering scheme, Linux 3.0 also has several new features: Btrfs data scrubbing and automatic defragmentation, XEN Dom0 support, ICMP_ECHO without privileges, wake on WLAN, Berkeley Packet Filter JIT filtering, a memcached-like system for the page cache is a sendmmsg () syscall that lots sendmsg () calls and setns (), a syscall that allows better handling of light virtualization systems such as continers. New hardware support has been added: for example, Microsoft Kinect, AMD Llano Spindle APUs, Intel iwlwifi 105 and 135, Intel C600 Serial-Attached SCSI Controller, Ralink RT5370 USB, various Realtek RTL81xx devices or Apple iSight webcam. Many other drivers and minor improvements have been added. Outstanding featuresBtrfs: automatic defragmentation, rubbing, performance enhancementsendmmsg (): lots of sendmsg () calls XEN support dom0CleanCacheBerkeley Packet Filter just-in-time filteringWake on WLAN supportNo privileges ICMP_ECHOallns () messages: better namespace handlingAlarm-timersDriver changes and a specific architectureVFSProcess schedulerMemory managementNetworkingFile systemsCryptoVirtualizationSecurityTracking / profilingVarious fundamental changes

1. Prominent Features1.1. Btrfs: Automatic Defragmenting, Scrubbing, Performance ImprovementsAutomatic Defragmentation

COW (copy-on-write) filesystems have many advantages, but also some disadvantages, for example fragmentation. Btrfs exposes the data in sequence when the files are first written to disk, but a COW project implies that any post-process changes should not be written over the old data, but placed in a free block, which will cause fragmentation. (RPM databases are a common case of this problem). Additionally, it suffers from fragmentation problems common to all file systems.

Btrfs already offers alternatives to combat this problem: First, it supports online defragmentation using the command "btrfs defragment file system". Second, it has a mount option, -o nodatacow, which disables COW for data. Now btrfs adds a third option, the autodefrag-mount option. This random mechanism detects small writes to files and queues them for an automatic defragmentation process, so that the file system will defragment itself as it is used. Not suitable for large database or virtualization workloads yet, but works well for smaller files, such as rpm, or SQLite bdb databases. Code: (commit)

Rub

Scrubbing is the process of checking the integrity of data in the file system. This initial implementation of scrubbing will check the checksums of all extensions in the file system. If an error occurs (checksum or IO error), a good copy is sought. If found, the bad copy will be rewritten. Code: (commit 1, 2)

Other enhancements

-File speedup creation / deletion: The performance of file creation and deletion in btrfs was very poor. The reason that for each creation or deletion, btrfs should make a lot of b + tree inserts, such as inode item, directory name item, directory name index, and so on. Now btrfs may do some late + b inserts tree or deletions, which allows to batch these modifications. File creation microbobmarks have been speed up by ~ 15% deletion, and file ~ 20%. Code: (commit)

– Do not wash unchanged file data items: accelerate fsync. The sysbench workload doing "random write + fsync" went from 112.75 requests / sec to 1,216 requests / sec. Code: (commit)

-Quasi-round-robin for space allocation in multi-device configurations: The request allocator currently always allocates space on devices in the same order. This leads to very uneven distribution, especially with RAID1 or RAID10 and an odd number of devices. Now Btrfs always ranks devices before allocating and allocates stripes on devices with more space available. Code: (commit)

1.2. sendmmsg (): lots of sendmsg () callsRecvmsg () and sendmsg () are the syscalls used to receive / send data to the network. On 2.6.33, Linux added recvmmsg (), a syscall that lets you receive on a single data call that would need multiple recvmsg () calls, improving throughput and latency for a number of scenarios. Now an equivalent sendmmsg () syscall has been added. The microbenchmark saw a 20% improvement in throughput on UDP send and 30% on raw socket send

Code: (commit)

1.3. XEN dom0 supportFinally, Linux has Xen dom0 support

1.4. CleanCacheRecommends LWN article: CleanCache and Frontswap

CleanCache is an optional feature that can potentially increase page cache performance. It could be described as a memcached-like system, but for cache pages. It offers memory storage not directly accessible or addressable by the kernel, and it does not guarantee that data will not disappear. It can be used by virtualization software to improve memory handling for clients, but it can also be useful for implementing things like a compressed cache.

Code: (commit), (commit)

1.5. Berkeley Packet Just-in-time filtering filterRecommends article LWN: A JIT for packet filters

The Berkeley Packet Filter filtering capabilities, used by tools like libpcap / tcpdump, are usually handled by an interpreter. This release adds a simple JIT that generates native code when the filter is loaded into memory (something already done by other operating systems, such as FreeBSD). Administrator needs to enable this feature written "1" for / proc / sys / net / core / bpf_jit_enable

Code: (commit)

1.6. Wake on support WLANWake on Wireless a feature to allow the system to enter a low power state (eg ACPI S3 suspend) while the wireless NIC remains active and does different things for the host, for example, to be connected to a AP or network search. The 802.11 stack added support for it.

Code: (commit 1, 2)

1.7. Unprivileged ICMP_ECHOR MessagesRecommend LWN Article: ICMP Sockets

This release makes it possible to send ICMP_ECHO (ping) messages and receive the corresponding ICMP_ECHOREPLY messages without any special privileges, similar to those implemented on Mac OS X. In other words, the patch makes it possible to implement setuid-less and CAP_NET_RAW- minus / bin / ping. Initially, this feature was written for Linux 2.4.32, but unfortunately it was never made public. The new functionality is disabled by default, and is enabled on startup by support for Linux distributions, optionally restricted to a group or a group range.

Code: (commit)

1.8. setns () syscall: better namespace handlingRecommended article LWN: File Descriptors Namespace

Linux supports different namespaces for many of its treaty features, for example, show light forms of virtualization such as containers or systemd-nspaw to virtualized processes a different virtual PID than the actual PID. The same thing can be done with the file system directory structure, network resources, IPC, etc. The only way to configure different namespace settings was to use different flags in syscall clone (), but the system didn't do things like allow to a namespace access process another process ". setns () syscall solves this problem-

Code: (commit 1, 2, 3, 4, 5, 6)

1.9. Alarm-timersRecommended article LWN: Waking Suspension Systems

Timer Alarm is a hybrid style timer, similar to high resolution timers, but when the system is suspended, the RTC device will set to fire and activate the system faster for when the timer alarm expires. The Alarm-Timers concept was inspired by the Android Alarm driver, and a userland interface uses the clock and POSIX timers interface, using two new clockids: CLOCK_REALTIME_ALARM and CLOCK_BOOTTIME_ALARM.

Code: (commit 1, 2)

2. Driver Changes and a Specific ArchitectureAll driver and a specific architecture changes can be found on the Linux_3.0_DriverArch page.

3. VFSCache security drop xattr check write: benchmarking on btrfs has shown that a large-scale bottleneck on large btrf systems currently xattr polling every write, which causes an additional tree walk, hitting some system file locks and scalability. too bad. This is also a problem in ext4, where it reaches the global mbcache lock. Cache this check solves the problem (commit)

4. Scheduler ProcessIncreasing Resolution SCHED_LOAD_SCALE: With this extra resolution, the programmer can handle deeper cgroup hiearchies and perform better load balancing and distribution on larger systems (especially for lightweight task groups) (commit), ( commit)

Move the second half of ttwu () to the remote CPU: it avoids having to take rq-> lock and do the task remotely, saving a lot on cacheline transfers. The semaphore reference ranges from 647,278 worker burns per second to 816,715 (commit)

Sleep and Anticipate Close Friends Tip: A worst-case milestone consisting of two 2-threaded tbench client processes each running on a single CPU went from 105.84 MB / s to 112.42 MB / sec ( commit)

5. Memory ManagementFaa mmu_gather preempemtible (commit)

Activate_page () Batch calls to reduce> containment zone lru_lock (commit)

tmpfs: implement generic xattr support (commit)

Memory cgroup controller:

Add memory.numastat API to NUMA (commit) statistics

Add pagefault count in memcg stats (commit)

Retrieve nodes memory in round-robin commit

Remove obsolete noswapaccount (commit) kernel parameter

6. NetworkingAllows configuring the network by fd (commit) namespace

Wireless

Add the ability to advertise possible interface combinations (commit)

Add support for scheduled scans (commit)

Add userspace authentication flag to commit loop

New notification to discover mesh pair candidates. (Commit)

Allow ethtool to set interface in loopback mode. (Commit)

Do not allow user caching to commit

ipset: SCTP, added UDPLite support (commit)

sctp: implement SCTP_GET_ASSOC_ID_LIST option socket (commit), implement notification event SCTP_SENDER_DRY_EVENT (commit)

bridge: enable the creation of netlink (commit) bridge devices, allow to create / delete fdb entries via netlink (commit)

batman-adv: multi vlan support for the loop detection bridge

pkt_sched: QFQ – quick fair queue scheduler (commit)

RDMA: Add netlink infrastructure that enables RDMA client registration (commit)

7. FILE SYSTEMS BLOCK LAYER

Send batch bio discard in blkdev_issue_discard () – makes data discarding faster (commit)

EXT4

Enable "holes" functionality (recommended LWN article) (commit), (commit)

Add support for multiple mount protection (commit)

CIFS

Add support for mounting Windows 2008 DFS shares (commit)

Convert cifs_writepages use async write (commit), (commit)

Add rwpidforward mount option that allows a mode when CIFS pid front of a process that opened a file for any read and commit operation

OCFS2

SSD support cut (commit), (commit)

Support for moving extensions (commit), (commit)

NILFS2

Implement ioctl resize (commit)

XFS

Add support discard online (commit)

8. CryptoCAAM – Add Support for SEC4 / CAAM Freescale (commit)

padlock – Add SHA-1/256 module for VIA Nano (commit)

s390: add CT mode z hardware support (commit), add g gshsh hardware support system (commit), add XTS mode z system hardware support (commit)

s5p-sss – S5PV210 add advanced support encryption engine (commit)

9. VirtualizationLuxury user mode: support earlyprintk add (commit), add Ethernet transport ucast (commit)

xen: support blkback add (commit)

10. SecurityAllow limiting ability helper usermode (commit)

SELinux

add / sys / fs / selinux mount point to put selinuxfs (commit)

Cache SELinux VFS RCU Walks Safe (VFS Performance Improves) (commit)

11. Tracing / profilingperf stat: Add-d-d and-d-d-d options to show most CPU events (commit), (commit)

perf stat: Add –sync / -S (commit) option

12. Several Fundamental Changesrcu: Priority to Boost TREE_PREEMPT_RCU (commit)

ulimit: ulimit default raise hard on file number to 4096 (commit)

cgroups

Remove the cgroup namespace subsystem. It has been replaced by clone_children "a compatibility flag, where a newly created cgroup will copy the parent cgroup values. Userspace must manually create a cgroup and add a task to the commit file

Make 'procs' file writable

kbuild: implement various levels of W = (commit)

PM / Hibernate: Add sysfs button to control memory size for drivers (commit)

posix-timers: RCU convert (commit)

coredump: add support for exe_file on behalf of core